Personal Loan Bank Campaign¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Personal Loan campaign image AI Generated Image

Objectives¶

To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign?
  • Securities_Account: Does the customer have securities account with the bank?
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?
  • Online: Do customers use internet banking facilities?
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?

Setup¶

Basic Setup and Imports¶

In [122]:
#ignoring warnings
import warnings
warnings.filterwarnings("ignore")
In [123]:
# Libraries to help with reading and manipulating data
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# libaries to help with data visualization
import matplotlib.pyplot as plt       # matplotlib.pyplot plots data
%matplotlib inline
import seaborn as sns
pd.set_option('mode.chained_assignment', None)

#restrict float value to 3 decimal places
pd.set_option('display.float_format',lambda x: '%.3f' % x)
%matplotlib inline

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

from sklearn.linear_model import LogisticRegression

# To get diferent metric scores


# Library to split data
from sklearn.model_selection import train_test_split

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
    #plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
)
In [124]:
#allow access to google drive
from google.colab import drive
drive.mount('/content/gdrive')
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
In [125]:
#paths
datapath= "/content/gdrive/MyDrive/Projects/ML/Data/"

#load csv into panda
ploan_df = pd.read_csv(datapath+ 'Loan_Modelling.csv')
In [126]:
# copying data to another variable to avoid any changes to original data
ploancp_df = ploan_df.copy()

Function Definitions¶

In [127]:
#Function to show %
def autopct_format(values):
        def my_format(pct):
            total = sum(values)
            val = int(round(pct*total/100.0))
            return '{:.1f}%\n({v:d})'.format(pct, v=val)
        return my_format


# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot


#function to show stacked barplots
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

### function to plot distributions wrt target

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()


# defining a function to compute different metrics to check performance of a classification model built using sklearn

def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
    """
    Function to compute different metrics, based on the threshold specified, to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # predicting using the independent variables
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    pred = np.round(pred_thres)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf


# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix, based on the threshold specified, with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    y_pred = np.round(pred_thres)

    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Quick Dataset Overview¶

View Header and Tail

In [128]:
ploancp_df.head()
Out[128]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.600 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.500 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.000 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.700 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.000 2 0 0 0 0 0 1
In [129]:
ploancp_df.tail()
Out[129]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.900 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.400 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.300 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.500 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.800 1 0 0 0 0 1 1

Visualize the shape and data types of dataset

In [130]:
ploancp_df.shape
Out[130]:
(5000, 14)
In [131]:
print("The dataset has", ploancp_df.shape[0], "rows and",ploancp_df.shape[1],"columns.")
number_accounts = ploancp_df.shape[0]
The dataset has 5000 rows and 14 columns.
In [132]:
ploancp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

We can see that all the features are numerical, also that there are no NaN values. However notice that some features are better as Categorical/Object type than Numerical.

Summary

In [133]:
ploancp_df.describe().T
Out[133]:
count mean std min 25% 50% 75% max
ID 5000.000 2500.500 1443.520 1.000 1250.750 2500.500 3750.250 5000.000
Age 5000.000 45.338 11.463 23.000 35.000 45.000 55.000 67.000
Experience 5000.000 20.105 11.468 -3.000 10.000 20.000 30.000 43.000
Income 5000.000 73.774 46.034 8.000 39.000 64.000 98.000 224.000
ZIPCode 5000.000 93169.257 1759.455 90005.000 91911.000 93437.000 94608.000 96651.000
Family 5000.000 2.396 1.148 1.000 1.000 2.000 3.000 4.000
CCAvg 5000.000 1.938 1.748 0.000 0.700 1.500 2.500 10.000
Education 5000.000 1.881 0.840 1.000 1.000 2.000 3.000 3.000
Mortgage 5000.000 56.499 101.714 0.000 0.000 0.000 101.000 635.000
Personal_Loan 5000.000 0.096 0.295 0.000 0.000 0.000 0.000 1.000
Securities_Account 5000.000 0.104 0.306 0.000 0.000 0.000 0.000 1.000
CD_Account 5000.000 0.060 0.238 0.000 0.000 0.000 0.000 1.000
Online 5000.000 0.597 0.491 0.000 0.000 1.000 1.000 1.000
CreditCard 5000.000 0.294 0.456 0.000 0.000 0.000 1.000 1.000
  • ID is the same as index, we can drop this feature.
  • Age Average age is 45, and the range is from 23 to 67. The age range seems acceptable.
  • Experience (Years of Experience) Average is 20, and the range is from -3 to 43. Negatives values are an error. We must address this later.
  • Income Average is 73K and the range is from 8K to 224K. This range seems acceptable.
  • Zip Code, This is actually not numerical, but a categorical feature. This may be useful to know which ZIP codes lead to reduced risk customers.
  • Family, the Average number of members is 2.4 and the range is from 1 to 4 members. This range is acceptable.
  • CCAvg (Monthly Average spending on Credit Card) Average is $1,938, and the range is $0 to $10,000. This range is acceptable.
  • Education, This appears as a numerical feature, but it should be categorical. However, according to the feature description, there are only 3 levels of education: a) Undergraduate, b) Graduate, c) Advanced/Professional

I wonder if the dataset only contains customers that have college degrees, so is "Undergraduate" a customer that has completed college education? "Graduate" means someone that has a masters degree? How about "Advanced/Professional", does this refer to a PhD?

We will analyze this in more detail later.

  • Mortage, Personal_Loan, Securities_Account, CD_Account, Online, CreditCard are all boolean fields (or numerical 0 or 1).

Data Preparation¶

Lets do a quick first data preparation process. We can drop the Customer ID.

In [134]:
#Dropping ID
ploancp_df.drop('ID',axis=1,inplace=True)
ploancp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIPCode             5000 non-null   int64  
 4   Family              5000 non-null   int64  
 5   CCAvg               5000 non-null   float64
 6   Education           5000 non-null   int64  
 7   Mortgage            5000 non-null   int64  
 8   Personal_Loan       5000 non-null   int64  
 9   Securities_Account  5000 non-null   int64  
 10  CD_Account          5000 non-null   int64  
 11  Online              5000 non-null   int64  
 12  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(12)
memory usage: 507.9 KB

Univariate Analysis¶

Null Values¶

In [135]:
ploancp_df.isnull().values.any()
Out[135]:
False
In [136]:
ploancp_df.isnull().sum()
Out[136]:
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

There are no Null values in the dataset as we had seen earlier.

Change Feature Types

Change features that should be categorical objects: ZIPCode, Education, Personal_Loan Securities_Account, CD_Account, Online, CreditCard

In [137]:
cols_num2cat = ['ZIPCode','Education','Personal_Loan','Securities_Account','CD_Account','Online','CreditCard']
ploancp_df[cols_num2cat] = ploancp_df[cols_num2cat].astype('object')

Review Categorical and Object type features

In [138]:
numeric_cols = ploancp_df.select_dtypes(include=[np.number]).columns
cat_cols =ploancp_df.describe(include=["object"]).columns

print("Categorical Columns:     ", cat_cols)
print("Numeric Columns:    ", numeric_cols)
Categorical Columns:      Index(['ZIPCode', 'Education', 'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard'],
      dtype='object')
Numeric Columns:     Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage'], dtype='object')

Checking different levels in categorical data

In [139]:
# Checking value counts of categorical variables
for i in cat_cols:
    print("*" * 50)
    print("Unique values in", i, "are :")
    print(ploancp_df[i].value_counts())
    print("-" * 50)
    print("Unique values Percentages", i, "are :")
    print(ploancp_df[i].value_counts(1)*100)
**************************************************
Unique values in ZIPCode are :
94720    169
94305    127
95616    116
90095     71
93106     57
        ... 
96145      1
94087      1
91024      1
93077      1
94598      1
Name: ZIPCode, Length: 467, dtype: int64
--------------------------------------------------
Unique values Percentages ZIPCode are :
94720   3.380
94305   2.540
95616   2.320
90095   1.420
93106   1.140
         ... 
96145   0.020
94087   0.020
91024   0.020
93077   0.020
94598   0.020
Name: ZIPCode, Length: 467, dtype: float64
**************************************************
Unique values in Education are :
1    2096
3    1501
2    1403
Name: Education, dtype: int64
--------------------------------------------------
Unique values Percentages Education are :
1   41.920
3   30.020
2   28.060
Name: Education, dtype: float64
**************************************************
Unique values in Personal_Loan are :
0    4520
1     480
Name: Personal_Loan, dtype: int64
--------------------------------------------------
Unique values Percentages Personal_Loan are :
0   90.400
1    9.600
Name: Personal_Loan, dtype: float64
**************************************************
Unique values in Securities_Account are :
0    4478
1     522
Name: Securities_Account, dtype: int64
--------------------------------------------------
Unique values Percentages Securities_Account are :
0   89.560
1   10.440
Name: Securities_Account, dtype: float64
**************************************************
Unique values in CD_Account are :
0    4698
1     302
Name: CD_Account, dtype: int64
--------------------------------------------------
Unique values Percentages CD_Account are :
0   93.960
1    6.040
Name: CD_Account, dtype: float64
**************************************************
Unique values in Online are :
1    2984
0    2016
Name: Online, dtype: int64
--------------------------------------------------
Unique values Percentages Online are :
1   59.680
0   40.320
Name: Online, dtype: float64
**************************************************
Unique values in CreditCard are :
0    3530
1    1470
Name: CreditCard, dtype: int64
--------------------------------------------------
Unique values Percentages CreditCard are :
0   70.600
1   29.400
Name: CreditCard, dtype: float64
  • ZIPCode: Notice that 94720, 94305, 95616, 90095, 93106 represent 10% of all ZIP Codes in the dataset.

  • Education: 41.920% have a undergraduate degree, 30.020% a graduate degree, and 28.060% an advanced/professional degree. Are there any customers in the dataset that have no degrees? If the labels are right, all customers have at least a college degree.

  • Personal Loan: 90.4% of customers in the dataset do NOT have a personal loan and only 9.6% have a loan.

  • Securities Account: 10.44% of customers have a securities account.

  • CD Account: 6.040% of customers have a CD account.

  • Online Use: 59.680% of customers access their bank account online.

  • Credit Card: 29.400% of customers have a credit card with the bank.

Single Feature Distributions Plots¶

In [140]:
# creating histograms for numeric_cols
ploancp_df[numeric_cols].hist(figsize=(14, 14))
plt.show()

Notice that roughly Age and Experience are somewhat symmetrical. Income, Mortgage, CCAvg are skewed right. These all make sense.

In [141]:
#create barplots for categorical features

fig, axes = plt.subplots(2, 3, figsize=(15, 5), sharey=True)
fig.tight_layout(h_pad=2)
plt.subplots_adjust(top=0.89)
fig.suptitle('Distribution - Categorical Features')
sns.countplot(ax=axes[1, 1], data=ploancp_df, x='Personal_Loan')
sns.countplot(ax=axes[0, 2], data=ploancp_df, x='Securities_Account')
sns.countplot(ax=axes[0, 1], data=ploancp_df, x='CD_Account')
sns.countplot(ax=axes[0, 0], data=ploancp_df, x='Online')
sns.countplot(ax=axes[1, 0], data=ploancp_df, x='CreditCard')

plt.show()

We noticed that most customers access their account Online, a small percentage of customers have CD Account, a Securities Account or a Personal Loan, and some have a Credit card.

Visualizing these as pie charts may be more useful.

In [142]:
#visualization Single Features as pie plots with percentage of total

fig, axes = plt.subplots(1, 7, figsize=(18, 18), sharey=True)

data_fy=ploancp_df['Family'].value_counts()
data_ed=ploancp_df['Education'].value_counts()
data_pl=ploancp_df['Personal_Loan'].value_counts()
data_sa =ploancp_df['Securities_Account'].value_counts()
data_cd = ploancp_df['CD_Account'].value_counts()
data_online =ploancp_df['Online'].value_counts()
data_cc = ploancp_df['CreditCard'].value_counts()

plt.title("Distribution - Single Feature")

ax1 = plt.subplot2grid((1,7), (0,0))
plt.pie(data_fy,labels=data_fy.index, autopct=autopct_format(data_fy),colors=("g","r"))
plt.title('Family')

ax1 = plt.subplot2grid((1,7), (0,1))
plt.pie(data_ed,labels=data_ed.index, autopct=autopct_format(data_ed),colors=("g","r"))
plt.title('Education')

ax1 = plt.subplot2grid((1,7),(0,2))
plt.pie(data_pl,labels=data_pl.index, autopct=autopct_format(data_pl),colors=("g","r"))
plt.title('Personal Loan')

ax1 = plt.subplot2grid((1,7), (0, 3))
plt.pie(data_sa,labels=data_sa.index, autopct=autopct_format(data_sa),colors=("g","r"))
plt.title('Securities Account')

ax1 = plt.subplot2grid((1,7), (0, 4))
plt.pie(data_cd,labels=data_cd.index, autopct=autopct_format(data_cd),colors=("g","r"))
plt.title('CD Account')

ax1 = plt.subplot2grid((1,7), (0, 5))
plt.pie(data_online,labels=data_online.index, autopct=autopct_format(data_online),colors=("g","r"))
plt.title('Online')

ax1 = plt.subplot2grid((1,7), (0, 6))
plt.pie(data_cc,labels=data_cc.index, autopct=autopct_format(data_cc),colors=("g","r"))
plt.title('Credit Card')

plt.show()

The same features visualized as pie charts with amounts and percentages of the total. Notice that 30% of customers have a credit card with the bank, 40% access their accounts online, 10% have a securities account, 9.6% have a Personal Loan, and 6% have a CD account.

Analysis Feature: Age¶

Lets take a look at Age in more detail.

In [143]:
plt.figure(figsize=(6,4), dpi= 60)

plt.title("Age Distribution", fontsize=16)
sns.boxplot(data=ploancp_df,x='Age')
plt.show()

The age distribution doesn't have any outliers and the median is nicely distributed between Q1 and Q3.

In [144]:
#same but as barplots
labeled_barplot(ploancp_df, 'Age', perc=True, n=None)

Analysis Feature: Experience¶

We has observed earlier that the min Experience was a negative number. Lets take a look into that.

In [145]:
#How many Experience values are negative
num_exp_negative= ploancp_df.loc[ploancp_df['Experience']<0].value_counts().sum()
num_total_cust = ploan_df.shape[0]
print("The number of customers w/ apparent Negative experience is",num_exp_negative,"which represents",(num_exp_negative/num_total_cust)*100,"% of the total customers.")
The number of customers w/ apparent Negative experience is 52 which represents 1.04 % of the total customers.
In [146]:
#see the negative experience accounts
ploancp_df.loc[ploancp_df['Experience']<0]
Out[146]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
89 25 -1 113 94303 4 2.300 3 0 0 0 0 0 1
226 24 -1 39 94085 2 1.700 2 0 0 0 0 0 0
315 24 -2 51 90630 3 0.300 3 0 0 0 0 1 0
451 28 -2 48 94132 2 1.750 3 89 0 0 0 1 0
524 24 -1 75 93014 4 0.200 1 0 0 0 0 1 0
536 25 -1 43 92173 3 2.400 2 176 0 0 0 1 0
540 25 -1 109 94010 4 2.300 3 314 0 0 0 1 0
576 25 -1 48 92870 3 0.300 3 0 0 0 0 0 1
583 24 -1 38 95045 2 1.700 2 0 0 0 0 1 0
597 24 -2 125 92835 2 7.200 1 0 0 1 0 0 1
649 25 -1 82 92677 4 2.100 3 0 0 0 0 1 0
670 23 -1 61 92374 4 2.600 1 239 0 0 0 1 0
686 24 -1 38 92612 4 0.600 2 0 0 0 0 1 0
793 24 -2 150 94720 2 2.000 1 0 0 0 0 1 0
889 24 -2 82 91103 2 1.600 3 0 0 0 0 1 1
909 23 -1 149 91709 1 6.330 1 305 0 0 0 0 1
1173 24 -1 35 94305 2 1.700 2 0 0 0 0 0 0
1428 25 -1 21 94583 4 0.400 1 90 0 0 0 1 0
1522 25 -1 101 94720 4 2.300 3 256 0 0 0 0 1
1905 25 -1 112 92507 2 2.000 1 241 0 0 0 1 0
2102 25 -1 81 92647 2 1.600 3 0 0 0 0 1 1
2430 23 -1 73 92120 4 2.600 1 0 0 0 0 1 0
2466 24 -2 80 94105 2 1.600 3 0 0 0 0 1 0
2545 25 -1 39 94720 3 2.400 2 0 0 0 0 1 0
2618 23 -3 55 92704 3 2.400 2 145 0 0 0 1 0
2717 23 -2 45 95422 4 0.600 2 0 0 0 0 1 1
2848 24 -1 78 94720 2 1.800 2 0 0 0 0 0 0
2876 24 -2 80 91107 2 1.600 3 238 0 0 0 0 0
2962 23 -2 81 91711 2 1.800 2 0 0 0 0 0 0
2980 25 -1 53 94305 3 2.400 2 0 0 0 0 0 0
3076 29 -1 62 92672 2 1.750 3 0 0 0 0 0 1
3130 23 -2 82 92152 2 1.800 2 0 0 1 0 0 1
3157 23 -1 13 94720 4 1.000 1 84 0 0 0 1 0
3279 26 -1 44 94901 1 2.000 2 0 0 0 0 0 0
3284 25 -1 101 95819 4 2.100 3 0 0 0 0 0 1
3292 25 -1 13 95616 4 0.400 1 0 0 1 0 0 0
3394 25 -1 113 90089 4 2.100 3 0 0 0 0 1 0
3425 23 -1 12 91605 4 1.000 1 90 0 0 0 1 0
3626 24 -3 28 90089 4 1.000 3 0 0 0 0 0 0
3796 24 -2 50 94920 3 2.400 2 0 0 1 0 0 0
3824 23 -1 12 95064 4 1.000 1 0 0 1 0 0 1
3887 24 -2 118 92634 2 7.200 1 0 0 1 0 1 0
3946 25 -1 40 93117 3 2.400 2 0 0 0 0 1 0
4015 25 -1 139 93106 2 2.000 1 0 0 0 0 0 1
4088 29 -1 71 94801 2 1.750 3 0 0 0 0 0 0
4116 24 -2 135 90065 2 7.200 1 0 0 0 0 1 0
4285 23 -3 149 93555 2 7.200 1 0 0 0 0 1 0
4411 23 -2 75 90291 2 1.800 2 0 0 0 0 1 1
4481 25 -2 35 95045 4 1.000 3 0 0 0 0 1 0
4514 24 -3 41 91768 4 1.000 3 0 0 0 0 1 0
4582 25 -1 69 92691 3 0.300 3 0 0 0 0 1 0
4957 29 -1 50 95842 2 1.750 3 0 0 0 0 0 1
In [147]:
#common Experience values and counts
ploancp_df['Experience'].value_counts()
Out[147]:
 32    154
 20    148
 9     147
 5     146
 23    144
 35    143
 25    142
 28    138
 18    137
 19    135
 26    134
 24    131
 3     129
 16    127
 14    127
 30    126
 17    125
 34    125
 27    125
 22    124
 29    124
 7     121
 6     119
 15    119
 8     119
 10    118
 13    117
 33    117
 11    116
 37    116
 36    114
 21    113
 4     113
 31    104
 12    102
 38     88
 2      85
 39     85
 1      74
 0      66
 40     57
 41     43
-1      33
-2      15
 42      8
-3       4
 43      3
Name: Experience, dtype: int64
In [148]:
plt.figure(figsize=(6,4), dpi= 60)
plt.title("Positive Experience Distribution", fontsize=16)
sns.boxplot(data=ploancp_df.loc[ploancp_df['Experience']>=0],x='Experience')
plt.show()

Positive Experience is nicely distributed with a median (Q2) around 20 years, Q1 around 11, and Q3 around 30.

In [149]:
plt.figure(figsize=(6,4), dpi= 60)

plt.title("Negative Experience Distribution", fontsize=16)
sns.boxplot(data=ploancp_df.loc[ploancp_df['Experience']<0],x='Experience')
plt.show()
In [150]:
#experience = -3
ploancp_x3_df = ploancp_df.loc[ploancp_df['Experience']==-3]
ploancp_x3_df
Out[150]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
2618 23 -3 55 92704 3 2.400 2 145 0 0 0 1 0
3626 24 -3 28 90089 4 1.000 3 0 0 0 0 0 0
4285 23 -3 149 93555 2 7.200 1 0 0 0 0 1 0
4514 24 -3 41 91768 4 1.000 3 0 0 0 0 1 0

Notice that there are only 4 entries with a negative experience -3. Lets contrast them to Age and Education because these 2 other features are also related with time.

In [151]:
#experience = -2
ploancp_x2_df = ploancp_df.loc[ploancp_df['Experience']==-2]
ploancp_x2_df
Out[151]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
315 24 -2 51 90630 3 0.300 3 0 0 0 0 1 0
451 28 -2 48 94132 2 1.750 3 89 0 0 0 1 0
597 24 -2 125 92835 2 7.200 1 0 0 1 0 0 1
793 24 -2 150 94720 2 2.000 1 0 0 0 0 1 0
889 24 -2 82 91103 2 1.600 3 0 0 0 0 1 1
2466 24 -2 80 94105 2 1.600 3 0 0 0 0 1 0
2717 23 -2 45 95422 4 0.600 2 0 0 0 0 1 1
2876 24 -2 80 91107 2 1.600 3 238 0 0 0 0 0
2962 23 -2 81 91711 2 1.800 2 0 0 0 0 0 0
3130 23 -2 82 92152 2 1.800 2 0 0 1 0 0 1
3796 24 -2 50 94920 3 2.400 2 0 0 1 0 0 0
3887 24 -2 118 92634 2 7.200 1 0 0 1 0 1 0
4116 24 -2 135 90065 2 7.200 1 0 0 0 0 1 0
4411 23 -2 75 90291 2 1.800 2 0 0 0 0 1 1
4481 25 -2 35 95045 4 1.000 3 0 0 0 0 1 0

Notice that the Age range for experience=-2 is between 23 to 28.

In [152]:
#experience = -1
ploancp_x1_df = ploancp_df.loc[ploancp_df['Experience']==-1]
ploancp_x1_df
Out[152]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
89 25 -1 113 94303 4 2.300 3 0 0 0 0 0 1
226 24 -1 39 94085 2 1.700 2 0 0 0 0 0 0
524 24 -1 75 93014 4 0.200 1 0 0 0 0 1 0
536 25 -1 43 92173 3 2.400 2 176 0 0 0 1 0
540 25 -1 109 94010 4 2.300 3 314 0 0 0 1 0
576 25 -1 48 92870 3 0.300 3 0 0 0 0 0 1
583 24 -1 38 95045 2 1.700 2 0 0 0 0 1 0
649 25 -1 82 92677 4 2.100 3 0 0 0 0 1 0
670 23 -1 61 92374 4 2.600 1 239 0 0 0 1 0
686 24 -1 38 92612 4 0.600 2 0 0 0 0 1 0
909 23 -1 149 91709 1 6.330 1 305 0 0 0 0 1
1173 24 -1 35 94305 2 1.700 2 0 0 0 0 0 0
1428 25 -1 21 94583 4 0.400 1 90 0 0 0 1 0
1522 25 -1 101 94720 4 2.300 3 256 0 0 0 0 1
1905 25 -1 112 92507 2 2.000 1 241 0 0 0 1 0
2102 25 -1 81 92647 2 1.600 3 0 0 0 0 1 1
2430 23 -1 73 92120 4 2.600 1 0 0 0 0 1 0
2545 25 -1 39 94720 3 2.400 2 0 0 0 0 1 0
2848 24 -1 78 94720 2 1.800 2 0 0 0 0 0 0
2980 25 -1 53 94305 3 2.400 2 0 0 0 0 0 0
3076 29 -1 62 92672 2 1.750 3 0 0 0 0 0 1
3157 23 -1 13 94720 4 1.000 1 84 0 0 0 1 0
3279 26 -1 44 94901 1 2.000 2 0 0 0 0 0 0
3284 25 -1 101 95819 4 2.100 3 0 0 0 0 0 1
3292 25 -1 13 95616 4 0.400 1 0 0 1 0 0 0
3394 25 -1 113 90089 4 2.100 3 0 0 0 0 1 0
3425 23 -1 12 91605 4 1.000 1 90 0 0 0 1 0
3824 23 -1 12 95064 4 1.000 1 0 0 1 0 0 1
3946 25 -1 40 93117 3 2.400 2 0 0 0 0 1 0
4015 25 -1 139 93106 2 2.000 1 0 0 0 0 0 1
4088 29 -1 71 94801 2 1.750 3 0 0 0 0 0 0
4582 25 -1 69 92691 3 0.300 3 0 0 0 0 1 0
4957 29 -1 50 95842 2 1.750 3 0 0 0 0 0 1

Now for Experience=-1, the age range is between 23 and 29.

How about 0 Experience? Is that value registered?

In [153]:
#experience = 0
ploancp_x0_df = ploancp_df.loc[ploancp_df['Experience']==0]
x0_median=ploancp_x0_df['Age'].median()
x0_min=ploancp_x0_df['Age'].min()
x0_max=ploancp_x0_df['Age'].max()

ploancp_x0_df
Out[153]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
105 24 0 35 94704 3 0.100 2 0 0 1 0 1 0
151 26 0 132 92834 3 6.500 3 0 1 0 0 0 1
155 24 0 60 94596 4 1.600 1 0 0 0 0 1 0
156 26 0 15 92131 4 0.400 1 0 0 0 0 0 1
160 29 0 134 95819 4 6.500 3 0 1 0 0 0 0
182 24 0 135 95133 1 1.500 1 0 0 0 0 1 0
256 26 0 99 92697 4 2.300 3 0 0 0 0 0 1
347 25 0 43 94305 2 1.600 3 0 0 1 1 1 1
363 25 0 30 92691 2 1.700 2 0 0 0 0 0 0
379 25 0 28 92093 2 1.700 2 0 0 0 0 0 0
418 27 0 33 90089 4 1.000 3 0 0 0 0 0 0
466 25 0 13 91342 2 0.900 3 0 0 0 0 1 0
495 25 0 44 94545 4 0.600 2 0 0 0 0 1 1
873 24 0 88 90740 3 0.800 1 134 0 0 0 0 0
1057 30 0 63 95503 2 1.750 3 0 0 0 0 1 0
1181 25 0 65 90095 4 0.200 1 0 0 1 0 0 0
1337 26 0 179 92028 4 2.100 2 0 1 0 0 0 0
1732 25 0 88 94566 2 1.800 2 319 0 0 0 1 1
1765 26 0 149 95051 2 7.200 1 154 0 0 0 0 0
1847 25 0 52 95126 3 2.600 3 159 0 0 0 0 0
2009 25 0 99 92735 1 1.900 1 323 0 0 0 0 0
2157 25 0 71 93727 4 0.200 1 78 0 1 0 0 0
2165 27 0 38 95929 4 1.000 3 154 0 0 0 1 0
2241 26 0 14 94301 4 0.400 1 94 0 0 0 1 0
2259 24 0 82 90401 3 0.800 1 0 0 0 0 1 0
2417 25 0 53 90095 2 1.600 3 0 0 0 0 1 1
2648 26 0 155 93105 2 7.200 1 0 0 0 0 0 0
2652 24 0 44 90089 4 1.600 1 180 0 0 0 1 0
2756 27 0 40 91301 4 1.000 3 0 0 0 0 1 0
3075 26 0 85 95616 2 1.600 3 0 0 0 0 0 0
3084 26 0 129 90028 3 0.700 2 0 1 0 0 0 0
3135 25 0 91 95039 2 1.800 2 321 0 0 0 0 0
3147 26 0 30 94024 4 1.300 3 0 0 0 0 1 1
3378 25 0 44 94536 4 0.600 2 0 0 0 0 0 1
3538 26 0 23 93561 1 0.100 2 0 0 0 0 0 0
3747 26 0 83 91360 3 3.900 2 0 1 0 0 1 0
3765 26 0 54 94706 3 0.300 3 0 0 0 0 1 0
3818 26 0 102 94305 4 2.300 3 0 0 0 0 0 0
3870 25 0 25 94596 2 0.900 3 0 0 0 0 0 0
3889 26 0 19 93014 1 0.100 2 121 0 0 0 1 0
3908 24 0 44 90638 3 0.100 2 0 0 0 0 0 0
3982 24 0 119 94566 1 1.500 1 0 0 0 0 1 0
4017 26 0 42 92009 4 1.300 3 153 0 0 0 1 0
4046 25 0 72 94303 3 2.600 3 0 0 0 0 1 0
4077 26 0 71 92093 4 1.800 2 0 0 1 0 1 0
4080 27 0 40 90068 1 2.000 2 110 0 0 0 0 1
4109 27 0 30 93107 4 1.000 3 0 0 0 0 1 1
4282 26 0 195 92093 3 6.330 3 0 1 1 1 1 0
4321 27 0 34 92717 1 2.000 2 112 0 0 0 0 1
4393 24 0 59 95521 4 1.600 1 0 0 0 0 0 0
4425 26 0 164 95973 2 4.000 3 301 1 0 0 1 0
4529 27 0 40 92103 4 1.000 3 0 0 0 0 0 0
4551 27 0 28 91330 4 1.500 2 0 0 0 0 1 0
4566 24 0 131 92831 1 5.400 1 0 0 0 0 1 0
4568 26 0 44 94305 4 1.300 3 0 0 1 0 0 0
4584 26 0 49 90089 3 2.400 2 0 0 0 0 0 0
4677 25 0 38 93407 2 1.600 3 0 0 0 0 0 0
4679 26 0 161 94551 2 7.200 1 0 0 0 0 0 0
4712 25 0 14 94309 2 0.900 3 0 0 0 0 0 1
4782 26 0 150 91311 2 7.200 1 0 0 0 0 0 1
4796 26 0 42 95032 4 1.300 3 0 0 1 0 0 0
4874 26 0 75 94061 3 0.300 3 0 0 0 0 0 0
4901 26 0 54 96094 3 1.100 2 0 0 0 0 1 0
4934 26 0 85 93950 2 1.600 3 0 0 0 0 1 1
4943 26 0 12 96003 1 0.100 2 0 0 0 0 1 0
4989 24 0 38 93555 1 1.000 3 0 0 0 0 1 0
In [154]:
print("For Experience 0, The minimum age is",x0_min, ",the mean age is",x0_median,",the max age is",x0_max)
For Experience 0, The minimum age is 24 ,the mean age is 26.0 ,the max age is 30

We can see that there are many customers with Experience 0.

In [155]:
#also for our plotting lets have a dataframe that includes all negative Experience.
ploancp_negEx_df = ploancp_df.loc[ploancp_df['Experience']<0]

We have to take care of the negative values.

Look at Experience -3, -2, -1. As we know, negative experience is not a valid value. How can we treat these negative values?

  • Change them to the mode or median?
  • Replace them with a 0?
  • Change them to positive?
  • Drop those rows?

We will do this during our bivariate analysis because we need to analyze Experience with other Time dependant features like Age and Education.

Analysis Feature: Income¶

Lets take a look at income in more detail.

In [156]:
plt.figure(figsize=(6,4), dpi= 60)
plt.title("Income Distribution", fontsize=16)
sns.boxplot(x='Income', data=ploancp_df, notch=False)
plt.show()

The median is 73,774 (obtaine from Description) and shown in the boxplot above. Notice that some customers make over 180K but these outliers are valid data.

In [157]:
plt.figure(figsize=(6,3), dpi= 80)
sns.kdeplot(ploancp_df['Income'], shade=True, color="g", alpha=.7)
plt.title("Income Distribution")
plt.show()

Lets bin the income to make a bar plot. From Description we know what the min, max, median values are. But lets get them again.

In [158]:
income_min = ploancp_df['Income'].min()
income_max = ploancp_df['Income'].max()
income_mean = ploancp_df['Income'].median()
print("The Income minimum is", income_min, "K, the Income mean is", income_mean,"K, and the Income max is",income_max,"K")
The Income minimum is 8 K, the Income mean is 64.0 K, and the Income max is 224 K
In [159]:
bins = [0, 25, 50, 75, 100, 125, 150, 175, 200, 225]
labels = ['[0..25)','[25..50)','[50..75)', '[75..100)','[100..125)','[125..150)','[150..175)','[175..200)','[200..225]']
ploancp_df['Income_bin'] = pd.cut(ploancp_df['Income'], bins=bins, labels=labels)
print (ploancp_df)
      Age  Experience  Income ZIPCode  Family  CCAvg Education  Mortgage  \
0      25           1      49   91107       4  1.600         1         0   
1      45          19      34   90089       3  1.500         1         0   
2      39          15      11   94720       1  1.000         1         0   
3      35           9     100   94112       1  2.700         2         0   
4      35           8      45   91330       4  1.000         2         0   
...   ...         ...     ...     ...     ...    ...       ...       ...   
4995   29           3      40   92697       1  1.900         3         0   
4996   30           4      15   92037       4  0.400         1        85   
4997   63          39      24   93023       2  0.300         3         0   
4998   65          40      49   90034       3  0.500         2         0   
4999   28           4      83   92612       3  0.800         1         0   

     Personal_Loan Securities_Account CD_Account Online CreditCard Income_bin  
0                0                  1          0      0          0   [25..50)  
1                0                  1          0      0          0   [25..50)  
2                0                  0          0      0          0    [0..25)  
3                0                  0          0      0          0  [75..100)  
4                0                  0          0      0          1   [25..50)  
...            ...                ...        ...    ...        ...        ...  
4995             0                  0          0      1          0   [25..50)  
4996             0                  0          0      1          0    [0..25)  
4997             0                  0          0      0          0    [0..25)  
4998             0                  0          0      1          0   [25..50)  
4999             0                  0          0      1          1  [75..100)  

[5000 rows x 14 columns]
In [160]:
labeled_barplot(ploancp_df, 'Income_bin', perc=True, n=None)

Notice that 25% of customers are in the [25..50) bin, and 20% are in the [50..75) bin.

Analysis Feature: ZIPCode¶

Lets analyze ZIP Codes in more detail.

In [161]:
#Show unique ZIPCodes.
ploancp_df['ZIPCode'].unique()
Out[161]:
array([91107, 90089, 94720, 94112, 91330, 92121, 91711, 93943, 93023,
       94710, 90277, 93106, 94920, 91741, 95054, 95010, 94305, 91604,
       94015, 90095, 91320, 95521, 95064, 90064, 94539, 94104, 94117,
       94801, 94035, 92647, 95814, 94114, 94115, 92672, 94122, 90019,
       95616, 94065, 95014, 91380, 95747, 92373, 92093, 94005, 90245,
       95819, 94022, 90404, 93407, 94523, 90024, 91360, 95670, 95123,
       90045, 91335, 93907, 92007, 94606, 94611, 94901, 92220, 93305,
       95134, 94612, 92507, 91730, 94501, 94303, 94105, 94550, 92612,
       95617, 92374, 94080, 94608, 93555, 93311, 94704, 92717, 92037,
       95136, 94542, 94143, 91775, 92703, 92354, 92024, 92831, 92833,
       94304, 90057, 92130, 91301, 92096, 92646, 92182, 92131, 93720,
       90840, 95035, 93010, 94928, 95831, 91770, 90007, 94102, 91423,
       93955, 94107, 92834, 93117, 94551, 94596, 94025, 94545, 95053,
       90036, 91125, 95120, 94706, 95827, 90503, 90250, 95817, 95503,
       93111, 94132, 95818, 91942, 90401, 93524, 95133, 92173, 94043,
       92521, 92122, 93118, 92697, 94577, 91345, 94123, 92152, 91355,
       94609, 94306, 96150, 94110, 94707, 91326, 90291, 92807, 95051,
       94085, 92677, 92614, 92626, 94583, 92103, 92691, 92407, 90504,
       94002, 95039, 94063, 94923, 95023, 90058, 92126, 94118, 90029,
       92806, 94806, 92110, 94536, 90623, 92069, 92843, 92120, 95605,
       90740, 91207, 95929, 93437, 90630, 90034, 90266, 95630, 93657,
       92038, 91304, 92606, 92192, 90745, 95060, 94301, 92692, 92101,
       94610, 90254, 94590, 92028, 92054, 92029, 93105, 91941, 92346,
       94402, 94618, 94904, 93077, 95482, 91709, 91311, 94509, 92866,
       91745, 94111, 94309, 90073, 92333, 90505, 94998, 94086, 94709,
       95825, 90509, 93108, 94588, 91706, 92109, 92068, 95841, 92123,
       91342, 90232, 92634, 91006, 91768, 90028, 92008, 95112, 92154,
       92115, 92177, 90640, 94607, 92780, 90009, 92518, 91007, 93014,
       94024, 90027, 95207, 90717, 94534, 94010, 91614, 94234, 90210,
       95020, 92870, 92124, 90049, 94521, 95678, 95045, 92653, 92821,
       90025, 92835, 91910, 94701, 91129, 90071, 96651, 94960, 91902,
       90033, 95621, 90037, 90005, 93940, 91109, 93009, 93561, 95126,
       94109, 93107, 94591, 92251, 92648, 92709, 91754, 92009, 96064,
       91103, 91030, 90066, 95403, 91016, 95348, 91950, 95822, 94538,
       92056, 93063, 91040, 92661, 94061, 95758, 96091, 94066, 94939,
       95138, 95762, 92064, 94708, 92106, 92116, 91302, 90048, 90405,
       92325, 91116, 92868, 90638, 90747, 93611, 95833, 91605, 92675,
       90650, 95820, 90018, 93711, 95973, 92886, 95812, 91203, 91105,
       95008, 90016, 90035, 92129, 90720, 94949, 90041, 95003, 95192,
       91101, 94126, 90230, 93101, 91365, 91367, 91763, 92660, 92104,
       91361, 90011, 90032, 95354, 94546, 92673, 95741, 95351, 92399,
       90274, 94087, 90044, 94131, 94124, 95032, 90212, 93109, 94019,
       95828, 90086, 94555, 93033, 93022, 91343, 91911, 94803, 94553,
       95211, 90304, 92084, 90601, 92704, 92350, 94705, 93401, 90502,
       94571, 95070, 92735, 95037, 95135, 94028, 96003, 91024, 90065,
       95405, 95370, 93727, 92867, 95821, 94566, 95125, 94526, 94604,
       96008, 93065, 96001, 95006, 90639, 92630, 95307, 91801, 94302,
       91710, 93950, 90059, 94108, 94558, 93933, 92161, 94507, 94575,
       95449, 93403, 93460, 95005, 93302, 94040, 91401, 95816, 92624,
       95131, 94965, 91784, 91765, 90280, 95422, 95518, 95193, 92694,
       90275, 90272, 91791, 92705, 91773, 93003, 90755, 96145, 94703,
       96094, 95842, 94116, 90068, 94970, 90813, 94404, 94598],
      dtype=object)
In [162]:
numUniqueZips=ploancp_df['ZIPCode'].nunique()
print("The number of ZIPCodes is",numUniqueZips)
The number of ZIPCodes is 467
In [163]:
labeled_barplot(ploancp_df, 'ZIPCode', perc=True, n=None)

This is also very difficult to visualize and alone it doesn't really give us much information. We will revise it during our bivariate analysis.

Analysis Feature: Family¶

We visualized Family earlier as a pie chart w/ percentage.

In [164]:
plt.figure(figsize=(6,4), dpi= 60)

plt.title("Family Size Distribution", fontsize=16)
sns.boxplot(data=ploancp_df,x='Family')
plt.show()

We see that the median is at 2 members per family, 1 is Q1, 3 is Q3.

Analysis Feature: Education¶

The description states that 1: undergraduate, 2: graduate, 3: advanced/professional degree.

In [165]:
education_count = ploancp_df['Education'].value_counts()
education_count
Out[165]:
1    2096
3    1501
2    1403
Name: Education, dtype: int64
In [166]:
#lets revisit the pie chart for Education, now with the proper labels
ed_df = ploancp_df['Education'].value_counts()
edlabels = {'UnderGraduate','Graduate','Advanced/Professional'}
plt.pie(ed_df,labels = edlabels, autopct=autopct_format(ed_df))
plt.title("Education Level Distribution")

plt.show()

Is the Education definition correct. Are all customers in the dataset college educated? Was the dataset made to only include college educated customers?

But in any case, lets stick to the categorization made by the bank which states that all customers in this dataset are college educated.

Analysis Feature: Credit Card Use¶

In [167]:
plt.figure(figsize=(6,4), dpi= 60)
plt.title("Credit Card Average Usage Distribution", fontsize=16)
sns.boxplot(data=ploancp_df,x='CCAvg')
plt.show()

There are many outliers. However, these may be possible. During our bivariate analysis, we will take a look at Credit Card average usage again to obtain more meaningful information.

Analysis Feature: Mortgage

In [168]:
#Mortage Distribution
plt.figure(figsize=(6,4), dpi= 60)
plt.title("Mortgage Distribution", fontsize=16)
sns.boxplot(data=ploancp_df,x='Mortgage')
plt.show()

Wouldn't Credit Card usage as a Percentage of Income be more useful? We will take a look at it later during the bivariate analysis.

Analysis Feature: Online, CD Account, Credit Card, Security Account¶

We have visualized these as countplots and as pie charts. What other insights can we obtain from these from a univariate analysis? Probably not much than we already saw earlier. So let's move on with the bivariate analysis.

Bivariate Analysis¶

Age vs Personal Loan¶

In [169]:
distribution_plot_wrt_target(ploancp_df, "Age", "Personal_Loan")

Without Loans and w/o outliers: Notice that customers' age 30, 38, 46,52, and 58 are the most common. With Loans and w/o outliers: Notice that 35 and 42 are the most common. What could we determine? Not much.

In [170]:
stacked_barplot(ploancp_df, "Age", "Personal_Loan")
Personal_Loan     0    1   All
Age                           
All            4520  480  5000
34              116   18   134
30              119   17   136
36               91   16   107
63               92   16   108
35              135   16   151
33              105   15   120
52              130   15   145
29              108   15   123
54              128   15   143
43              134   15   149
42              112   14   126
56              121   14   135
65               66   14    80
44              107   14   121
50              125   13   138
45              114   13   127
46              114   13   127
26               65   13    78
32              108   12   120
57              120   12   132
38              103   12   115
27               79   12    91
48              106   12   118
61              110   12   122
53              101   11   112
51              119   10   129
60              117   10   127
58              133   10   143
49              105   10   115
47              103   10   113
59              123    9   132
28               94    9   103
62              114    9   123
55              116    9   125
64               70    8    78
41              128    8   136
40              117    8   125
37               98    8   106
31              118    7   125
39              127    6   133
24               28    0    28
25               53    0    53
66               24    0    24
67               12    0    12
23               12    0    12
------------------------------------------------------------------------------------------------------------------------

From the stacked plot we can see that customers aged 65, 26, 36, 63, 34, 27, 30, 33, 29 are the ones that have highest percentage of personal loans within their own age group.

Now notice that age 65 has the highest % of loans within its total, followed by 26, 36, 63, 34, 27, and 30. So 63+ group may have a larger chance to be loan customers. and also the group between 26-27 and 33-36.

Experience vs Age for Negative Experience¶

During the univariate analysis we noticed that there are entries with Negative Experience, which of course are wrong values.

We could replace Negative Experience with Median and Mode values, with zeros, or making the values positive and then analyze against other time dependant features like Age and Education.

In [171]:
#case to mode/median
Experience_mode = ploancp_df['Experience'].mode()
print("Experience mode is", Experience_mode[0])
Experience_median = ploancp_df['Experience'].median()
print("Experience median is",Experience_median)
Experience mode is 32
Experience median is 20.0
In [172]:
ploancp_negEx_df['modeX']= ploancp_negEx_df['Experience']-Experience_mode[0]
ploancp_negEx_df['medianX']= ploancp_negEx_df['Experience']-Experience_median
In [173]:
sns.displot(ploancp_negEx_df, x="modeX", kind="kde")
plt.xlabel("Age")
plt.title("Age minus Experience Mode")
plt.show()
In [174]:
sns.displot(ploancp_negEx_df, x="medianX", kind="kde")
plt.xlabel("Age")
plt.title("Age minus Experience Median")
plt.show()

So it wouldn't make any sense to change Experience=-3, -2, -1 to the mode or median.

  • Changing Negative experience to mode will produce Age < Experience!
  • Changing Negative experience to median would produce Ages between 3 to 9 years of age!

Therefore, lets rule the Mode/Median Option.

Therefore, lets rule out these 2 options.

How about replacing Experience by 0?

In [175]:
ploancp_x0_df = ploancp_df.loc[ploancp_df['Experience']<0]
ploancp_x0_df['Experience'] = 0
ploancp_x0_df
Out[175]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Income_bin
89 25 0 113 94303 4 2.300 3 0 0 0 0 0 1 [100..125)
226 24 0 39 94085 2 1.700 2 0 0 0 0 0 0 [25..50)
315 24 0 51 90630 3 0.300 3 0 0 0 0 1 0 [50..75)
451 28 0 48 94132 2 1.750 3 89 0 0 0 1 0 [25..50)
524 24 0 75 93014 4 0.200 1 0 0 0 0 1 0 [50..75)
536 25 0 43 92173 3 2.400 2 176 0 0 0 1 0 [25..50)
540 25 0 109 94010 4 2.300 3 314 0 0 0 1 0 [100..125)
576 25 0 48 92870 3 0.300 3 0 0 0 0 0 1 [25..50)
583 24 0 38 95045 2 1.700 2 0 0 0 0 1 0 [25..50)
597 24 0 125 92835 2 7.200 1 0 0 1 0 0 1 [100..125)
649 25 0 82 92677 4 2.100 3 0 0 0 0 1 0 [75..100)
670 23 0 61 92374 4 2.600 1 239 0 0 0 1 0 [50..75)
686 24 0 38 92612 4 0.600 2 0 0 0 0 1 0 [25..50)
793 24 0 150 94720 2 2.000 1 0 0 0 0 1 0 [125..150)
889 24 0 82 91103 2 1.600 3 0 0 0 0 1 1 [75..100)
909 23 0 149 91709 1 6.330 1 305 0 0 0 0 1 [125..150)
1173 24 0 35 94305 2 1.700 2 0 0 0 0 0 0 [25..50)
1428 25 0 21 94583 4 0.400 1 90 0 0 0 1 0 [0..25)
1522 25 0 101 94720 4 2.300 3 256 0 0 0 0 1 [100..125)
1905 25 0 112 92507 2 2.000 1 241 0 0 0 1 0 [100..125)
2102 25 0 81 92647 2 1.600 3 0 0 0 0 1 1 [75..100)
2430 23 0 73 92120 4 2.600 1 0 0 0 0 1 0 [50..75)
2466 24 0 80 94105 2 1.600 3 0 0 0 0 1 0 [75..100)
2545 25 0 39 94720 3 2.400 2 0 0 0 0 1 0 [25..50)
2618 23 0 55 92704 3 2.400 2 145 0 0 0 1 0 [50..75)
2717 23 0 45 95422 4 0.600 2 0 0 0 0 1 1 [25..50)
2848 24 0 78 94720 2 1.800 2 0 0 0 0 0 0 [75..100)
2876 24 0 80 91107 2 1.600 3 238 0 0 0 0 0 [75..100)
2962 23 0 81 91711 2 1.800 2 0 0 0 0 0 0 [75..100)
2980 25 0 53 94305 3 2.400 2 0 0 0 0 0 0 [50..75)
3076 29 0 62 92672 2 1.750 3 0 0 0 0 0 1 [50..75)
3130 23 0 82 92152 2 1.800 2 0 0 1 0 0 1 [75..100)
3157 23 0 13 94720 4 1.000 1 84 0 0 0 1 0 [0..25)
3279 26 0 44 94901 1 2.000 2 0 0 0 0 0 0 [25..50)
3284 25 0 101 95819 4 2.100 3 0 0 0 0 0 1 [100..125)
3292 25 0 13 95616 4 0.400 1 0 0 1 0 0 0 [0..25)
3394 25 0 113 90089 4 2.100 3 0 0 0 0 1 0 [100..125)
3425 23 0 12 91605 4 1.000 1 90 0 0 0 1 0 [0..25)
3626 24 0 28 90089 4 1.000 3 0 0 0 0 0 0 [25..50)
3796 24 0 50 94920 3 2.400 2 0 0 1 0 0 0 [25..50)
3824 23 0 12 95064 4 1.000 1 0 0 1 0 0 1 [0..25)
3887 24 0 118 92634 2 7.200 1 0 0 1 0 1 0 [100..125)
3946 25 0 40 93117 3 2.400 2 0 0 0 0 1 0 [25..50)
4015 25 0 139 93106 2 2.000 1 0 0 0 0 0 1 [125..150)
4088 29 0 71 94801 2 1.750 3 0 0 0 0 0 0 [50..75)
4116 24 0 135 90065 2 7.200 1 0 0 0 0 1 0 [125..150)
4285 23 0 149 93555 2 7.200 1 0 0 0 0 1 0 [125..150)
4411 23 0 75 90291 2 1.800 2 0 0 0 0 1 1 [50..75)
4481 25 0 35 95045 4 1.000 3 0 0 0 0 1 0 [25..50)
4514 24 0 41 91768 4 1.000 3 0 0 0 0 1 0 [25..50)
4582 25 0 69 92691 3 0.300 3 0 0 0 0 1 0 [50..75)
4957 29 0 50 95842 2 1.750 3 0 0 0 0 0 1 [25..50)
In [176]:
#lets visualize age vs Experience when negative Experience is replaced by 0
sns.displot(ploancp_x0_df, x="Age", hue="Experience", kind="kde")
plt.title("Age vs Experience 0")
plt.show()

Notice : 29 years there -> 0 Experience. Also, between 21 to 27 -> 0 Experience, being the largest around 24-25 -> 0 Experience.

In [177]:
ploancp_x0_age_min_df = ploancp_x0_df['Age'].min()
ploancp_x0_age_median_df = ploancp_x0_df['Age'].median()
ploancp_x0_age_max_df = ploancp_x0_df['Age'].max()
print("For negative experience converted to 0, the minimum age is",ploancp_x0_age_min_df,
      ",the mean age is",ploancp_x0_age_median_df,
      ",the maximum age is",ploancp_x0_age_max_df)
For negative experience converted to 0, the minimum age is 23 ,the mean age is 24.0 ,the maximum age is 29

For Experience = 0

  • min: 24, mean: 26, max: 30

For Negative Experience converted to 0

  • min: 23, mean: 24m max: 29

Differences between Experience 0 and Converted to 0:

  • min: 1, mean: 2, max: 1

How about converting negative experience to positive values, so for example a negative experience of -3 becomes 3?

In [178]:
#make all experience year positive
ploancp_negEx_df['allPosX']= ploancp_negEx_df['Experience'].abs()
ploancp_negEx_df
Out[178]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard modeX medianX allPosX
89 25 -1 113 94303 4 2.300 3 0 0 0 0 0 1 -33 -21.000 1
226 24 -1 39 94085 2 1.700 2 0 0 0 0 0 0 -33 -21.000 1
315 24 -2 51 90630 3 0.300 3 0 0 0 0 1 0 -34 -22.000 2
451 28 -2 48 94132 2 1.750 3 89 0 0 0 1 0 -34 -22.000 2
524 24 -1 75 93014 4 0.200 1 0 0 0 0 1 0 -33 -21.000 1
536 25 -1 43 92173 3 2.400 2 176 0 0 0 1 0 -33 -21.000 1
540 25 -1 109 94010 4 2.300 3 314 0 0 0 1 0 -33 -21.000 1
576 25 -1 48 92870 3 0.300 3 0 0 0 0 0 1 -33 -21.000 1
583 24 -1 38 95045 2 1.700 2 0 0 0 0 1 0 -33 -21.000 1
597 24 -2 125 92835 2 7.200 1 0 0 1 0 0 1 -34 -22.000 2
649 25 -1 82 92677 4 2.100 3 0 0 0 0 1 0 -33 -21.000 1
670 23 -1 61 92374 4 2.600 1 239 0 0 0 1 0 -33 -21.000 1
686 24 -1 38 92612 4 0.600 2 0 0 0 0 1 0 -33 -21.000 1
793 24 -2 150 94720 2 2.000 1 0 0 0 0 1 0 -34 -22.000 2
889 24 -2 82 91103 2 1.600 3 0 0 0 0 1 1 -34 -22.000 2
909 23 -1 149 91709 1 6.330 1 305 0 0 0 0 1 -33 -21.000 1
1173 24 -1 35 94305 2 1.700 2 0 0 0 0 0 0 -33 -21.000 1
1428 25 -1 21 94583 4 0.400 1 90 0 0 0 1 0 -33 -21.000 1
1522 25 -1 101 94720 4 2.300 3 256 0 0 0 0 1 -33 -21.000 1
1905 25 -1 112 92507 2 2.000 1 241 0 0 0 1 0 -33 -21.000 1
2102 25 -1 81 92647 2 1.600 3 0 0 0 0 1 1 -33 -21.000 1
2430 23 -1 73 92120 4 2.600 1 0 0 0 0 1 0 -33 -21.000 1
2466 24 -2 80 94105 2 1.600 3 0 0 0 0 1 0 -34 -22.000 2
2545 25 -1 39 94720 3 2.400 2 0 0 0 0 1 0 -33 -21.000 1
2618 23 -3 55 92704 3 2.400 2 145 0 0 0 1 0 -35 -23.000 3
2717 23 -2 45 95422 4 0.600 2 0 0 0 0 1 1 -34 -22.000 2
2848 24 -1 78 94720 2 1.800 2 0 0 0 0 0 0 -33 -21.000 1
2876 24 -2 80 91107 2 1.600 3 238 0 0 0 0 0 -34 -22.000 2
2962 23 -2 81 91711 2 1.800 2 0 0 0 0 0 0 -34 -22.000 2
2980 25 -1 53 94305 3 2.400 2 0 0 0 0 0 0 -33 -21.000 1
3076 29 -1 62 92672 2 1.750 3 0 0 0 0 0 1 -33 -21.000 1
3130 23 -2 82 92152 2 1.800 2 0 0 1 0 0 1 -34 -22.000 2
3157 23 -1 13 94720 4 1.000 1 84 0 0 0 1 0 -33 -21.000 1
3279 26 -1 44 94901 1 2.000 2 0 0 0 0 0 0 -33 -21.000 1
3284 25 -1 101 95819 4 2.100 3 0 0 0 0 0 1 -33 -21.000 1
3292 25 -1 13 95616 4 0.400 1 0 0 1 0 0 0 -33 -21.000 1
3394 25 -1 113 90089 4 2.100 3 0 0 0 0 1 0 -33 -21.000 1
3425 23 -1 12 91605 4 1.000 1 90 0 0 0 1 0 -33 -21.000 1
3626 24 -3 28 90089 4 1.000 3 0 0 0 0 0 0 -35 -23.000 3
3796 24 -2 50 94920 3 2.400 2 0 0 1 0 0 0 -34 -22.000 2
3824 23 -1 12 95064 4 1.000 1 0 0 1 0 0 1 -33 -21.000 1
3887 24 -2 118 92634 2 7.200 1 0 0 1 0 1 0 -34 -22.000 2
3946 25 -1 40 93117 3 2.400 2 0 0 0 0 1 0 -33 -21.000 1
4015 25 -1 139 93106 2 2.000 1 0 0 0 0 0 1 -33 -21.000 1
4088 29 -1 71 94801 2 1.750 3 0 0 0 0 0 0 -33 -21.000 1
4116 24 -2 135 90065 2 7.200 1 0 0 0 0 1 0 -34 -22.000 2
4285 23 -3 149 93555 2 7.200 1 0 0 0 0 1 0 -35 -23.000 3
4411 23 -2 75 90291 2 1.800 2 0 0 0 0 1 1 -34 -22.000 2
4481 25 -2 35 95045 4 1.000 3 0 0 0 0 1 0 -34 -22.000 2
4514 24 -3 41 91768 4 1.000 3 0 0 0 0 1 0 -35 -23.000 3
4582 25 -1 69 92691 3 0.300 3 0 0 0 0 1 0 -33 -21.000 1
4957 29 -1 50 95842 2 1.750 3 0 0 0 0 0 1 -33 -21.000 1

Lets see our results when we convert negative experience to its positive value.

In [179]:
#lets visualize age vs Experience when negative Experience is replaced by its positive value
sns.displot(ploancp_negEx_df, x="Age", hue="allPosX", kind="kde")
plt.title("Age vs Experience Positive for negatives")
plt.show()

If we converted the negative experience to Positive values, we notice 29 years -> 1 Experience, and a mostly around 25 years -> 1 Experience. We also notice that around 24 years -> 2 Experience, and at 28 years -> 2 Experience. Notice that around 23-24 years -> 3 Experience.

Does this make sense?

In [180]:
#experience = 1
ploancp_x1_df = ploancp_df.loc[ploancp_df['Experience']==1]
x1_median=ploancp_x1_df['Age'].median()
x1_min=ploancp_x1_df['Age'].min()
x1_max=ploancp_x1_df['Age'].max()

ploancp_x1_df
Out[180]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Income_bin
0 25 1 49 91107 4 1.600 1 0 0 1 0 0 0 [25..50)
132 31 1 51 90840 2 1.750 3 0 0 0 0 0 0 [50..75)
143 25 1 54 94117 4 1.600 1 0 0 0 0 1 1 [50..75)
165 27 1 43 94706 1 1.500 1 0 0 0 0 1 0 [25..50)
166 25 1 21 95827 3 1.000 2 0 0 0 0 0 0 [0..25)
169 27 1 112 90503 4 2.100 3 0 0 0 0 0 1 [100..125)
170 27 1 138 90250 2 2.000 1 0 0 0 0 1 0 [125..150)
234 26 1 80 95616 1 0.800 2 150 0 0 0 0 0 [75..100)
249 26 1 55 90089 3 2.600 3 113 0 0 0 0 1 [50..75)
263 27 1 74 92121 4 1.800 2 112 0 0 0 1 1 [50..75)
484 25 1 113 95023 2 0.200 1 0 0 0 0 1 1 [100..125)
514 27 1 74 91730 3 0.300 3 0 0 0 0 1 0 [50..75)
732 26 1 85 90064 1 1.900 1 0 0 0 0 1 0 [75..100)
920 27 1 42 94501 3 2.400 2 0 0 0 0 0 0 [25..50)
964 27 1 78 92037 4 2.300 3 157 0 1 0 1 0 [75..100)
1003 25 1 62 94720 4 0.000 1 229 0 0 0 1 0 [50..75)
1065 25 1 113 90401 3 2.500 1 0 0 0 0 0 1 [100..125)
1092 25 1 70 92120 4 2.600 1 0 0 1 0 1 0 [50..75)
1160 28 1 40 95134 1 2.000 2 0 0 1 0 1 0 [25..50)
1204 26 1 190 91604 4 1.300 2 197 1 0 0 1 0 [175..200)
1230 27 1 25 94920 4 0.300 2 0 0 0 0 1 1 [0..25)
1255 27 1 80 95354 2 1.600 3 185 0 0 0 1 1 [75..100)
1262 26 1 53 94720 2 1.600 3 0 0 0 0 1 0 [50..75)
1556 31 1 60 94143 4 4.000 3 244 0 0 0 0 0 [50..75)
1653 26 1 24 96651 2 0.900 3 123 0 0 0 0 1 [0..25)
1690 26 1 102 95521 1 1.900 1 0 0 0 0 0 0 [100..125)
1868 25 1 118 92833 1 5.400 1 0 0 0 0 1 1 [100..125)
1984 26 1 55 92630 4 1.700 2 175 0 0 0 1 0 [50..75)
2192 25 1 13 95814 4 1.000 1 95 0 0 0 0 1 [0..25)
2226 25 1 98 90717 1 5.400 1 0 0 0 0 1 0 [75..100)
2273 27 1 83 91775 4 2.100 3 0 0 0 0 1 1 [75..100)
2360 27 1 85 93302 2 1.600 3 0 0 0 0 0 0 [75..100)
2367 26 1 80 95616 4 0.200 1 0 0 0 0 1 1 [75..100)
2389 27 1 41 90033 1 1.900 3 0 0 0 0 0 1 [25..50)
2446 25 1 70 93010 4 2.600 1 218 0 0 0 1 0 [50..75)
2452 25 1 28 94596 1 1.000 3 0 0 0 0 1 0 [25..50)
2526 26 1 50 95616 4 0.600 2 0 0 0 0 0 0 [25..50)
2527 27 1 43 95120 3 1.100 2 0 0 0 0 0 0 [25..50)
2675 31 1 70 92115 2 1.750 3 0 0 0 1 1 1 [50..75)
2754 26 1 61 93943 4 2.200 1 119 0 0 0 0 0 [50..75)
2815 26 1 48 94019 3 2.600 3 169 0 0 0 0 0 [25..50)
2836 25 1 74 94085 4 2.600 1 204 0 0 0 0 0 [50..75)
2898 27 1 140 91711 1 5.900 2 175 1 1 1 1 0 [125..150)
3010 25 1 72 94301 3 0.800 1 0 0 0 0 1 0 [50..75)
3146 26 1 38 91910 4 1.700 2 0 0 0 0 1 0 [25..50)
3154 27 1 99 94305 1 3.000 3 149 1 0 0 0 1 [75..100)
3339 27 1 141 95135 4 5.100 3 354 1 0 0 0 0 [125..150)
3440 26 1 39 95133 4 0.600 2 0 0 0 0 0 1 [25..50)
3459 26 1 88 94025 2 1.800 2 0 0 0 0 0 0 [75..100)
3486 25 1 20 92806 4 1.000 1 0 0 0 0 0 1 [0..25)
3506 27 1 58 95827 4 1.800 2 154 0 0 1 1 1 [50..75)
3627 27 1 83 90034 2 0.200 1 0 0 0 0 0 1 [75..100)
3708 31 1 74 92116 4 4.000 3 0 0 0 0 0 0 [50..75)
3711 27 1 20 94720 4 0.400 1 99 0 0 0 1 0 [0..25)
3732 26 1 18 92521 2 0.900 3 95 0 0 0 0 0 [0..25)
3845 26 1 54 94061 4 0.600 2 230 0 0 0 0 0 [50..75)
3884 27 1 112 91330 4 2.300 3 402 0 0 0 1 1 [100..125)
4026 27 1 142 92038 3 5.500 1 0 1 0 0 0 0 [125..150)
4235 27 1 91 92173 2 0.200 1 0 0 0 0 1 0 [75..100)
4271 25 1 150 92507 1 6.330 1 0 0 0 0 0 0 [125..150)
4281 28 1 34 94949 4 1.500 2 162 0 0 0 0 1 [25..50)
4305 26 1 54 91709 2 1.600 3 0 0 1 0 0 0 [50..75)
4345 26 1 184 94608 2 4.200 3 577 1 0 1 1 1 [175..200)
4504 27 1 41 93023 4 1.800 3 147 0 0 0 0 0 [25..50)
4507 26 1 8 94550 2 0.900 3 0 0 0 0 0 1 [0..25)
4627 27 1 134 93106 1 1.700 2 307 1 0 0 1 0 [125..150)
4628 27 1 130 94801 3 2.900 2 0 1 0 0 0 0 [125..150)
4669 27 1 64 94501 4 1.800 2 0 0 0 0 1 1 [50..75)
4709 26 1 35 90089 2 1.700 2 119 0 0 0 0 1 [25..50)
4713 25 1 122 93022 2 0.200 1 0 0 0 0 1 0 [100..125)
4888 25 1 121 93106 1 5.400 1 158 0 0 0 1 0 [100..125)
4900 26 1 74 90028 4 2.200 1 0 0 0 0 1 0 [50..75)
4973 31 1 68 95045 4 4.000 3 0 0 0 0 1 0 [50..75)
4984 27 1 98 94043 4 2.300 3 0 0 0 0 0 1 [75..100)
In [181]:
print("For Experience 1, the minimum age is",x1_min,
      ",the mean age is",x1_median,
      ",the maximum age is",x1_max)
For Experience 1, the minimum age is 25 ,the mean age is 26.0 ,the maximum age is 31
In [182]:
#experience = 2
ploancp_x2_df = ploancp_df.loc[ploancp_df['Experience']==2]
x2_median=ploancp_x2_df['Age'].median()
x2_min=ploancp_x2_df['Age'].min()
x2_max=ploancp_x2_df['Age'].max()

ploancp_x2_df
Out[182]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Income_bin
58 28 2 93 94065 2 0.200 1 0 0 0 0 0 0 [75..100)
85 27 2 109 94005 4 1.800 3 0 0 0 0 0 0 [100..125)
222 26 2 104 94306 3 2.500 1 0 0 0 0 0 0 [100..125)
277 29 2 30 92126 4 1.000 3 0 0 0 0 0 0 [25..50)
318 27 2 110 95670 4 1.800 3 190 0 0 0 1 0 [100..125)
349 26 2 60 93407 2 3.000 1 132 1 0 0 0 0 [50..75)
397 26 2 48 90503 3 0.700 2 0 0 0 0 1 0 [25..50)
401 29 2 30 95747 4 1.500 2 112 0 0 0 0 1 [25..50)
526 26 2 205 93106 1 6.330 1 271 0 0 0 0 1 [200..225]
533 27 2 101 92807 1 1.900 1 0 0 0 0 0 0 [100..125)
546 27 2 68 94025 3 2.600 3 203 0 1 0 0 0 [50..75)
554 28 2 149 94720 2 7.200 1 0 0 0 0 1 0 [125..150)
608 27 2 55 91910 4 1.700 2 0 0 0 0 1 0 [50..75)
675 29 2 33 91711 1 2.000 2 160 0 0 0 0 0 [25..50)
692 26 2 30 94720 1 1.000 3 111 0 0 0 0 0 [25..50)
770 26 2 172 94551 2 6.900 2 0 1 0 0 1 0 [150..175)
798 29 2 38 93063 1 2.000 2 0 0 0 0 0 0 [25..50)
853 27 2 155 95138 1 0.800 1 0 0 0 0 0 0 [150..175)
864 28 2 10 94080 1 0.100 2 0 0 0 0 1 0 [0..25)
904 28 2 51 90503 4 1.800 2 0 0 1 0 0 0 [50..75)
1059 28 2 11 91203 1 0.100 2 0 0 0 0 1 1 [0..25)
1113 28 2 70 90630 3 0.300 3 0 0 0 0 0 1 [50..75)
1182 28 2 19 94720 4 0.400 1 0 0 0 0 1 1 [0..25)
1213 27 2 78 93943 4 0.200 1 87 0 0 0 0 0 [75..100)
1238 28 2 63 91116 2 1.600 3 0 0 0 0 1 0 [50..75)
1265 32 2 71 95014 2 1.750 3 108 0 0 0 0 0 [50..75)
1275 27 2 92 95616 2 3.100 1 178 0 0 0 1 0 [75..100)
1307 26 2 195 94546 1 6.330 1 0 0 0 0 1 1 [175..200)
1349 26 2 171 93943 3 6.000 2 0 1 0 0 1 0 [150..175)
1350 29 2 29 90266 4 1.500 2 0 0 0 0 0 1 [25..50)
1432 26 2 195 90245 1 6.330 1 0 0 0 0 0 0 [175..200)
1624 28 2 31 90024 2 0.300 2 0 0 1 0 1 0 [25..50)
1722 26 2 72 92647 4 2.600 1 0 0 1 1 1 0 [50..75)
1839 28 2 43 95616 4 1.300 3 0 0 0 0 1 1 [25..50)
1895 26 2 72 95003 4 2.600 1 0 0 0 0 1 0 [50..75)
1931 28 2 140 92122 2 2.000 1 0 0 0 0 1 0 [125..150)
1958 28 2 42 95762 1 1.500 1 0 0 0 0 1 1 [25..50)
1972 28 2 114 94606 4 2.100 3 0 0 0 0 1 0 [100..125)
2000 28 2 22 95670 1 0.100 2 0 0 0 0 1 0 [0..25)
2079 26 2 40 94132 1 1.000 3 0 0 0 0 1 0 [25..50)
2112 27 2 103 93117 1 1.900 1 120 0 0 0 1 0 [100..125)
2123 28 2 9 95014 1 0.100 2 0 0 0 0 1 0 [0..25)
2186 26 2 92 96001 2 0.200 1 0 0 0 0 1 0 [75..100)
2270 26 2 51 92103 4 2.600 1 0 0 0 0 1 0 [50..75)
2304 27 2 170 95818 3 4.700 1 0 1 0 0 1 0 [150..175)
2314 27 2 112 94501 4 1.800 3 0 0 1 0 1 0 [100..125)
2328 27 2 130 92182 3 4.400 1 192 1 0 0 1 0 [125..150)
2387 28 2 51 94720 4 1.800 3 0 0 0 0 0 1 [50..75)
2500 28 2 121 92096 2 2.000 1 341 0 0 0 1 0 [100..125)
2685 28 2 101 90280 4 2.100 3 0 0 0 0 1 0 [100..125)
2807 27 2 129 90009 2 3.300 1 0 0 1 0 0 0 [125..150)
2860 27 2 20 95064 4 0.500 3 0 0 0 0 1 0 [0..25)
2884 28 2 48 93943 4 2.100 3 0 0 0 0 1 1 [25..50)
2897 28 2 34 92161 4 1.300 3 0 0 0 0 0 0 [25..50)
2951 26 2 132 94720 2 2.400 3 0 1 0 0 0 1 [125..150)
3037 27 2 158 95060 3 0.400 2 0 1 0 1 1 0 [150..175)
3040 28 2 33 95814 3 1.000 1 167 0 0 0 1 0 [25..50)
3055 28 2 111 94305 4 2.300 3 0 0 0 0 1 0 [100..125)
3121 28 2 13 91791 4 0.400 1 0 0 0 0 0 1 [0..25)
3228 27 2 45 94305 2 1.700 2 0 0 0 0 0 1 [25..50)
3373 28 2 182 92660 3 7.200 2 442 1 0 1 1 1 [175..200)
3469 26 2 79 95630 2 2.500 1 0 0 0 0 1 0 [75..100)
3494 29 2 31 91330 4 1.500 2 0 0 0 0 0 0 [25..50)
3579 28 2 84 94305 1 2.900 3 102 0 1 1 0 1 [75..100)
3663 26 2 60 94111 4 1.600 1 0 0 0 1 1 1 [50..75)
3751 26 2 12 94591 4 1.000 1 0 0 0 0 1 0 [0..25)
3809 26 2 62 94080 4 1.600 1 0 0 1 0 0 0 [50..75)
3875 26 2 119 95351 2 0.600 1 381 0 0 0 1 1 [100..125)
3885 32 2 69 93943 4 4.000 3 102 0 0 0 1 0 [50..75)
3932 26 2 55 94305 3 0.700 2 0 0 0 0 1 0 [50..75)
4048 27 2 48 90049 2 1.600 3 119 0 1 0 1 0 [25..50)
4085 28 2 53 94609 3 2.400 2 0 0 0 0 1 0 [50..75)
4100 27 2 41 90254 2 1.700 2 0 0 0 0 1 0 [25..50)
4113 28 2 41 93118 3 1.100 2 161 0 0 0 1 0 [25..50)
4185 26 2 82 91950 2 2.500 1 199 0 0 0 0 0 [75..100)
4265 27 2 44 93943 4 0.600 2 0 0 1 1 1 0 [25..50)
4337 26 2 182 93010 2 3.200 2 0 1 0 0 0 0 [175..200)
4362 28 2 55 93940 3 1.100 2 0 0 0 0 0 0 [50..75)
4365 26 2 85 95020 2 2.500 1 0 0 0 0 0 0 [75..100)
4413 29 2 31 91775 4 1.500 2 0 0 0 0 0 1 [25..50)
4508 27 2 85 94117 1 1.900 1 0 0 0 0 1 1 [75..100)
4563 28 2 188 92350 2 4.500 1 0 0 1 0 1 0 [175..200)
4757 26 2 135 94588 1 1.500 1 0 0 0 0 0 0 [125..150)
4769 26 2 20 95064 4 1.000 1 116 0 0 0 0 0 [0..25)
4772 26 2 95 92130 3 0.800 1 0 0 0 0 1 0 [75..100)
In [183]:
print("For Experience 2, the minimum age is",x2_min,
      ",the mean age is",x2_median,
      ",the maximum age is",x2_max)
For Experience 2, the minimum age is 26 ,the mean age is 27.0 ,the maximum age is 32
In [184]:
#experience = 3
ploancp_x3_df = ploancp_df.loc[ploancp_df['Experience']==3]
x3_median=ploancp_x3_df['Age'].median()
x3_min=ploancp_x3_df['Age'].min()
x3_max=ploancp_x3_df['Age'].max()

ploancp_x3_df
Out[184]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Income_bin
74 28 3 135 94611 2 3.300 1 0 0 0 0 0 1 [125..150)
177 29 3 65 94132 4 1.800 2 244 0 0 0 0 0 [50..75)
183 29 3 148 92173 3 4.100 1 0 1 0 0 1 0 [125..150)
198 27 3 59 94123 4 0.000 1 90 0 1 0 1 0 [50..75)
202 30 3 68 94306 4 2.000 2 0 0 0 0 1 0 [50..75)
216 27 3 125 95521 2 0.600 1 0 0 0 0 0 0 [100..125)
239 28 3 52 94112 4 1.700 2 0 0 0 0 0 0 [50..75)
272 29 3 45 95023 4 0.200 1 158 0 0 0 1 1 [25..50)
338 29 3 153 93657 2 2.000 1 392 0 0 0 0 0 [150..175)
399 28 3 84 90024 4 0.200 1 0 0 0 0 1 1 [75..100)
421 28 3 115 92333 4 3.100 2 0 1 0 0 0 0 [100..125)
425 28 3 28 90505 4 0.800 1 0 0 0 0 1 0 [25..50)
457 29 3 69 94303 3 0.300 3 0 0 0 0 0 0 [50..75)
552 28 3 52 90024 4 2.200 1 230 0 0 0 1 0 [50..75)
562 28 3 85 94035 1 0.800 2 0 0 0 0 1 1 [75..100)
581 28 3 55 94521 4 2.200 1 0 0 0 0 1 1 [50..75)
590 29 3 39 94612 4 2.100 3 0 0 0 0 1 0 [25..50)
604 28 3 70 90245 4 2.200 1 240 0 0 0 0 1 [50..75)
607 28 3 170 95014 1 0.100 3 0 1 0 0 0 0 [150..175)
731 28 3 90 90066 2 3.300 1 0 0 0 0 1 1 [75..100)
760 29 3 52 92122 3 1.100 2 0 0 0 0 1 0 [50..75)
789 29 3 31 92126 4 0.300 2 0 0 0 0 1 0 [25..50)
799 29 3 39 95051 4 2.100 3 0 0 0 0 1 0 [25..50)
840 27 3 94 92373 2 0.200 1 310 0 0 0 0 1 [75..100)
878 33 3 74 95616 4 4.000 3 0 0 0 0 0 0 [50..75)
899 30 3 172 91302 3 3.400 2 0 1 0 0 0 1 [150..175)
906 29 3 154 94720 2 2.000 1 130 0 0 0 0 0 [150..175)
931 27 3 43 91302 1 1.000 3 0 0 0 0 1 0 [25..50)
995 28 3 45 94305 2 1.600 3 0 0 0 0 1 1 [25..50)
1009 28 3 25 91330 2 0.900 3 140 0 0 0 1 0 [0..25)
1010 27 3 98 95616 2 2.500 1 361 0 1 1 1 1 [75..100)
1019 29 3 30 91745 4 0.300 2 157 0 0 0 0 0 [25..50)
1022 27 3 118 95605 1 3.300 2 0 1 0 0 1 0 [100..125)
1077 29 3 175 90095 3 3.300 3 329 1 0 0 1 0 [150..175)
1083 28 3 65 95014 3 2.600 3 0 0 1 0 0 0 [50..75)
1093 27 3 40 94550 3 0.100 2 111 0 0 0 1 0 [25..50)
1102 29 3 84 95023 1 2.900 3 0 0 0 0 1 0 [75..100)
1176 29 3 103 90049 4 3.400 1 0 1 0 0 1 0 [100..125)
1177 28 3 71 90405 1 3.300 2 149 1 1 1 1 0 [50..75)
1194 29 3 41 94305 4 1.300 3 0 0 0 0 1 0 [25..50)
1286 29 3 50 94010 3 1.100 2 0 0 0 0 0 1 [25..50)
1316 28 3 51 94086 2 1.600 3 123 0 0 0 0 0 [50..75)
1321 27 3 123 95138 1 5.400 1 0 0 0 0 0 0 [100..125)
1377 27 3 109 93023 2 2.500 1 0 0 0 0 1 0 [100..125)
1386 27 3 72 95616 4 0.000 1 0 0 0 0 1 0 [50..75)
1390 29 3 80 94305 4 1.800 2 0 0 0 0 1 1 [75..100)
1424 29 3 92 94539 2 1.300 1 287 0 0 0 1 0 [75..100)
1437 28 3 123 92007 1 0.800 1 146 0 0 0 0 0 [100..125)
1588 29 3 55 95616 3 1.100 2 0 0 0 0 1 0 [50..75)
1618 29 3 29 94720 3 1.000 1 0 0 0 0 1 1 [25..50)
1642 27 3 84 95814 3 1.500 1 0 0 0 0 1 1 [75..100)
1701 29 3 108 94304 4 1.800 2 0 0 0 0 0 0 [100..125)
1711 27 3 201 95819 1 6.330 1 158 0 0 0 1 0 [200..225]
1744 28 3 29 91105 4 0.800 1 135 0 0 0 1 0 [25..50)
1755 28 3 55 92647 4 1.700 2 0 0 0 0 1 1 [50..75)
1778 27 3 32 94710 3 1.000 2 0 0 0 0 0 0 [25..50)
1785 29 3 190 94080 2 4.500 1 0 0 0 0 1 0 [175..200)
1802 29 3 121 92806 2 1.300 1 0 0 0 0 0 0 [100..125)
1811 28 3 11 94534 4 0.500 3 0 0 0 0 0 0 [0..25)
1875 27 3 112 90066 3 2.500 1 389 0 1 0 1 0 [100..125)
1970 27 3 148 92780 1 1.500 1 397 0 0 0 1 1 [125..150)
1975 29 3 113 94132 2 0.200 1 0 0 0 0 1 1 [100..125)
2022 33 3 71 93561 4 1.800 3 236 0 0 0 0 0 [50..75)
2029 30 3 61 92152 4 2.000 2 0 0 0 0 1 0 [50..75)
2052 28 3 120 94080 1 0.800 1 170 0 0 0 0 0 [100..125)
2059 28 3 173 92121 2 6.700 1 222 0 0 0 1 0 [150..175)
2072 29 3 39 95831 4 0.200 1 137 0 0 0 1 1 [25..50)
2146 27 3 30 93108 1 1.000 3 80 0 0 0 0 0 [25..50)
2147 27 3 20 92007 4 1.000 1 0 0 0 0 0 0 [0..25)
2163 33 3 69 92161 4 1.800 3 0 0 0 0 1 0 [50..75)
2164 27 3 104 92007 2 2.500 1 184 0 1 0 1 0 [100..125)
2190 27 3 110 96150 2 0.200 1 294 0 1 0 0 1 [100..125)
2215 28 3 193 94501 3 4.000 2 0 1 0 0 1 0 [175..200)
2261 30 3 150 94305 4 5.000 2 0 1 0 0 1 0 [125..150)
2268 27 3 105 94304 1 3.000 2 0 1 1 0 0 0 [100..125)
2272 27 3 90 91365 3 0.800 1 0 0 0 0 1 0 [75..100)
2276 29 3 172 92093 4 4.400 1 0 1 0 0 0 0 [150..175)
2296 27 3 82 94305 2 0.200 1 0 0 0 0 0 1 [75..100)
2443 28 3 161 92646 4 1.700 3 422 1 0 1 1 1 [150..175)
2489 29 3 41 92626 4 0.200 1 0 0 0 0 1 0 [25..50)
2492 28 3 134 96091 2 3.100 1 0 0 0 0 1 0 [125..150)
2516 28 3 74 94720 3 2.600 3 0 0 0 0 0 0 [50..75)
2741 29 3 49 90266 1 1.500 1 0 0 0 0 0 0 [25..50)
2789 27 3 34 90065 1 0.200 3 0 0 0 0 1 1 [25..50)
2843 27 3 20 95616 4 1.000 1 134 0 0 0 1 1 [0..25)
2853 28 3 54 94550 4 0.600 2 0 0 0 0 1 0 [50..75)
2918 28 3 142 93727 1 0.800 1 0 0 0 0 1 0 [125..150)
2940 27 3 43 90245 3 0.100 2 163 0 0 0 1 0 [25..50)
2963 29 3 41 94588 1 1.900 3 0 0 0 0 1 1 [25..50)
3012 29 3 172 92373 2 4.500 1 415 0 0 0 1 0 [150..175)
3070 28 3 74 91330 2 1.800 2 221 0 1 0 0 0 [50..75)
3180 27 3 103 92121 2 0.600 1 84 0 0 0 0 0 [100..125)
3201 28 3 81 92121 4 0.200 1 0 0 0 0 0 0 [75..100)
3340 29 3 54 94104 4 1.800 3 0 0 0 0 0 0 [50..75)
3350 28 3 95 90245 2 1.800 2 0 0 0 0 0 0 [75..100)
3389 27 3 88 92182 3 0.800 1 238 0 0 0 0 0 [75..100)
3390 29 3 73 94720 3 0.300 3 0 0 0 0 0 0 [50..75)
3453 29 3 31 94709 4 0.300 2 0 0 0 0 1 0 [25..50)
3463 28 3 149 92121 1 0.800 1 0 0 0 0 1 0 [125..150)
3503 29 3 53 95814 4 2.100 3 0 0 0 0 1 0 [50..75)
3583 30 3 33 95112 4 1.500 2 85 0 0 0 0 0 [25..50)
3592 33 3 20 94704 1 0.670 3 0 0 0 0 0 0 [0..25)
3623 28 3 45 91105 4 1.700 2 95 0 0 0 0 0 [25..50)
3667 27 3 59 94590 4 1.600 1 0 0 0 0 1 0 [50..75)
3728 28 3 118 91902 3 2.400 2 161 1 0 0 0 0 [100..125)
3745 27 3 119 90640 1 5.400 1 118 0 0 0 1 0 [100..125)
3776 27 3 135 93108 3 2.700 3 449 1 0 0 0 1 [125..150)
3914 27 3 35 94080 1 1.800 2 0 0 0 0 0 0 [25..50)
3945 29 3 123 92821 3 5.600 3 428 1 0 0 1 0 [100..125)
3968 28 3 78 93108 4 0.200 1 0 0 0 0 1 1 [75..100)
4042 29 3 190 92612 2 4.500 1 246 0 0 0 1 1 [175..200)
4061 33 3 59 91040 2 1.750 3 0 0 0 0 1 0 [50..75)
4098 27 3 75 90032 4 0.000 1 0 0 0 0 0 0 [50..75)
4129 29 3 10 91320 4 0.400 1 87 0 0 0 1 1 [0..25)
4139 29 3 81 95827 1 2.900 3 0 0 0 0 0 0 [75..100)
4179 29 3 91 94122 1 3.400 3 0 1 0 0 0 0 [75..100)
4274 30 3 79 91380 4 2.000 2 0 0 0 0 1 0 [75..100)
4341 28 3 53 94305 2 1.600 3 0 0 0 0 0 0 [50..75)
4351 30 3 32 94132 1 2.000 2 0 0 0 0 1 1 [25..50)
4370 27 3 18 93524 1 0.400 3 0 0 0 0 0 0 [0..25)
4456 29 3 35 94040 2 0.300 1 88 0 0 1 1 1 [25..50)
4515 29 3 49 94305 4 2.100 3 0 0 0 0 0 0 [25..50)
4663 28 3 115 92407 1 1.900 1 200 0 0 0 1 0 [100..125)
4681 27 3 68 95503 4 0.000 1 0 0 0 0 0 1 [50..75)
4688 29 3 69 92093 4 1.800 2 0 0 0 0 1 1 [50..75)
4714 27 3 81 90291 3 1.500 1 307 0 1 1 1 1 [75..100)
4872 27 3 69 94305 3 0.700 2 0 0 0 0 1 1 [50..75)
4952 29 3 53 94005 4 1.800 3 0 0 0 0 1 0 [50..75)
4995 29 3 40 92697 1 1.900 3 0 0 0 0 1 0 [25..50)
In [185]:
print("For Experience 3, the minimum age is",x3_min,
      ",the mean age is",x3_median,
      ",the maximum age is",x3_max)
For Experience 3, the minimum age is 27 ,the mean age is 28.0 ,the maximum age is 33
In [186]:
#for converted negative experience to positive
#experience = 3
ploancp_NegX_3_df = ploancp_negEx_df.loc[ploancp_negEx_df['allPosX']==3]
NegX3_median=ploancp_NegX_3_df['Age'].median()
NegX3_min=ploancp_NegX_3_df['Age'].min()
NegX3_max=ploancp_NegX_3_df['Age'].max()
print("For Negative Experience -3 to +3, the minimum age is",NegX3_min,
      ",the mean age is",NegX3_median,
      ",the maximum age is",NegX3_max)
For Negative Experience -3 to +3, the minimum age is 23 ,the mean age is 23.5 ,the maximum age is 24
In [187]:
#for converted negative experience to positive
#experience = 2
ploancp_NegX_2_df = ploancp_negEx_df.loc[ploancp_negEx_df['allPosX']==2]
NegX2_median=ploancp_NegX_2_df['Age'].median()
NegX2_min=ploancp_NegX_2_df['Age'].min()
NegX2_max=ploancp_NegX_2_df['Age'].max()
print("For Negative Experience -2 to +2, the minimum age is",NegX2_min,
      ",the mean age is",NegX2_median,
      ",the maximum age is",NegX2_max)
For Negative Experience -2 to +2, the minimum age is 23 ,the mean age is 24.0 ,the maximum age is 28
In [188]:
#for converted negative experience to positive
#experience = 1
ploancp_NegX_1_df = ploancp_negEx_df.loc[ploancp_negEx_df['allPosX']==1]
NegX1_median=ploancp_NegX_1_df['Age'].median()
NegX1_min=ploancp_NegX_1_df['Age'].min()
NegX1_max=ploancp_NegX_1_df['Age'].max()
print("For Negative Experience -1 to +1, the minimum age is",NegX1_min,
      ",the mean age is",NegX1_median,
      ",the maximum age is",NegX1_max)
For Negative Experience -1 to +1, the minimum age is 23 ,the mean age is 25.0 ,the maximum age is 29

Comparing to the case when we converted negative experiences to 0 to the actual positive values:

Actual 1: min: 25, mean: 26, max: 31

Converted 1: min: 23, mean: 25, max: 29

Differences 1-Actual-Converted: min: 2, mean: 1, max: 2


Actual 2: min: 26, mean: 27, max: 32

Converted 2: min: 23, mean: 24, max: 28

Differences 2-Actual-Converted: min: 3, mean: 3, max: 4


Actual 3: min: 27, mean: 28, max: 33

Converted 3: min: 23, mean: 23.5, max: 24

Differences 2-Actual-Converted: min: 4, mean: 4.5, max: 9

Comparing Converting Negative Experience to 0 or Converting to their positive values, we notice that the minimum, median, and maximum ages for the case of Negative Experience converted to 0 are more similar to the actual Experience =0 of the dataset than the minimum, median, and maximum ages if we would convert negative experiences to their respective positive values and compare them to the actual corresponding positive Experience values of the dataset.

Therefore, we choose to convert negative Experience values to 0.

In [189]:
ploancp_df.loc[ploancp_df['Experience'] < 0, 'Experience'] = 0

Experience vs Personal Loan¶

In [190]:
distribution_plot_wrt_target(ploancp_df, "Experience", "Personal_Loan")

Notice that Personal Loan Customers have largest amount of common experience at 5 to 12 years, 18-20 years, and around 30 years. Customers w/o Loans have largest amount of common experiences at 7-8, 18-19, 24-26 years.

Income vs Personal Loan¶

In [191]:
distribution_plot_wrt_target(ploancp_df, "Income", "Personal_Loan")

Personal Loan customers have largest common amount of income 120-140K, and 170-190K. Customers w/o Personal Loans have largest common amount of income 45K.

In [192]:
stacked_barplot(ploancp_df, "Income", "Personal_Loan")
Personal_Loan     0    1   All
Income                        
All            4520  480  5000
130               8   11    19
182               2   11    13
158               8   10    18
135               8   10    18
179               8    9    17
141              15    9    24
154              12    9    21
123               9    9    18
184               3    9    12
142               7    8    15
131              11    8    19
129              15    8    23
172               3    8    11
173               5    8    13
170               4    8    12
180              10    8    18
115              19    8    27
125              16    7    23
164               6    7    13
188               3    7    10
83               67    7    74
114              23    7    30
161               9    7    16
122              17    7    24
133               8    7    15
132              11    7    18
191               6    7    13
134              13    7    20
111              15    7    22
190               4    7    11
145              17    6    23
140              13    6    19
178               4    6    10
118              13    6    19
185               3    6     9
165               5    6    11
168               2    6     8
169               1    6     7
183               6    6    12
120              11    6    17
139              10    6    16
113              29    5    34
119              13    5    18
99               19    5    24
138              13    5    18
155              14    5    19
195              10    5    15
174               4    5     9
175               7    5    12
152              10    5    15
153               7    4    11
181               4    4     8
103              14    4    18
93               33    4    37
108              12    4    16
101              20    4    24
194               4    4     8
192               2    4     6
193               2    4     6
143               5    4     9
149              16    4    20
171               5    4     9
160               8    4    12
159               3    4     7
128              20    4    24
148               7    4    11
162               7    3    10
112              23    3    26
110              16    3    19
124               9    3    12
105              17    3    20
104              17    3    20
102              13    3    16
109              15    3    18
95               22    3    25
150               9    2    11
94               24    2    26
163               7    2     9
91               35    2    37
98               26    2    28
89               32    2    34
121              18    2    20
85               63    2    65
144               5    2     7
65               59    1    60
71               42    1    43
69               45    1    46
100               9    1    10
60               51    1    52
189               1    1     2
73               43    1    44
151               3    1     4
201               4    1     5
64               59    1    60
202               1    1     2
81               82    1    83
92               28    1    29
90               37    1    38
75               46    1    47
82               60    1    61
84               62    1    63
203               1    1     2
33               51    0    51
198               3    0     3
31               55    0    55
30               63    0    63
29               67    0    67
28               63    0    63
25               64    0    64
24               47    0    47
23               54    0    54
22               65    0    65
224               1    0     1
21               65    0    65
20               47    0    47
19               52    0    52
200               3    0     3
218               1    0     1
18               53    0    53
204               3    0     3
15               33    0    33
199               3    0     3
14               31    0    31
13               32    0    32
12               30    0    30
205               2    0     2
11               27    0    27
10               23    0    23
32               58    0    58
48               44    0    44
34               53    0    53
35               65    0    65
9                26    0    26
88               26    0    26
80               56    0    56
79               53    0    53
78               61    0    61
74               45    0    45
72               41    0    41
70               47    0    47
68               35    0    35
63               46    0    46
62               55    0    55
61               57    0    57
59               53    0    53
58               55    0    55
55               61    0    61
54               52    0    52
53               57    0    57
52               47    0    47
51               41    0    41
50               45    0    45
49               52    0    52
45               69    0    69
44               85    0    85
43               70    0    70
42               77    0    77
41               82    0    82
40               78    0    78
39               81    0    81
38               84    0    84
8                23    0    23
------------------------------------------------------------------------------------------------------------------------

Notice that 170K, 182K, 168K, 184K, 172K, 188K, 185K, 192K are the groups that have a largest percentages of Personal Loans within their group.

ZIPCode vs Personal Loan¶

Where are the majority of customers?

In [193]:
#for the number of accounts
ziploan = ploancp_df[['ZIPCode','Personal_Loan']]
ziploan_count = ziploan['ZIPCode'].value_counts()
ziploan_count_df= pd.DataFrame(ziploan_count)
ziploan_count_df.reset_index(inplace=True)
ziploan_count_df = ziploan_count_df.rename(columns={'ZIPCode': 'NoAccounts', 'index': 'ZIPCode'})
ziploan_count_descending_df = ziploan_count_df.sort_values(['NoAccounts'],ascending=False)
ziploan_count_descending_df
Out[193]:
ZIPCode NoAccounts
0 94720 169
1 94305 127
2 95616 116
3 90095 71
4 93106 57
... ... ...
460 92694 1
459 90068 1
457 90813 1
456 94404 1
466 94598 1

467 rows × 2 columns

The majority of customers (10%) are in 5 zip codes: 94720, 94305, 95616, 90095, 93106, all corresponding to states in the West coast: California, Oregon, and Washington.

Lets dig into this a bit more.

Does the ZIP code influence whether a personal loan is taken?

How about comparing the number of loans in a ZIP code as a percentage of the total number of loans in the dataset, or the number of accounts in a ZIP code as a percentage of the total number of accounts in the dataset?

In [194]:
ziploan_count_descending_df.head(10)
Out[194]:
ZIPCode NoAccounts
0 94720 169
1 94305 127
2 95616 116
3 90095 71
4 93106 57
5 93943 54
6 92037 54
7 91320 53
8 91711 52
9 94025 52
In [195]:
#for the number of loans
ziploanpl = ploancp_df[ploancp_df['Personal_Loan']== True]
number_loans = ziploanpl['Personal_Loan'].value_counts()
ziploanpl_count = ziploanpl['ZIPCode'].value_counts()
ziploanpl_count_df = pd.DataFrame(ziploanpl_count)
ziploanpl_count_df.reset_index(inplace=True)
ziploanpl_count_df = ziploanpl_count_df.rename(columns={'ZIPCode': 'PLoans', 'index': 'ZIPCode'})
In [196]:
#calculating the percentages
zipLoanTrue = pd.merge(ziploan_count_df, ziploanpl_count_df, on ='ZIPCode')
zipLoanTrue['PercentLoans'] = (zipLoanTrue['PLoans']/number_loans[1])
zipLoanTrue['PercentAccounts']= (zipLoanTrue['NoAccounts']/number_accounts)

#sort in descending order for number of Loans
zipLoanTrue_descending_df = zipLoanTrue.sort_values(['PLoans'],ascending=False)
zipLoanTrue_descending_df.head(20)
Out[196]:
ZIPCode NoAccounts PLoans PercentLoans PercentAccounts
0 94720 169 19 0.040 0.034
1 94305 127 13 0.027 0.025
10 92093 51 9 0.019 0.010
16 94304 45 8 0.017 0.009
3 90095 71 8 0.017 0.014
13 90089 46 8 0.017 0.009
20 92182 32 7 0.015 0.006
35 94022 25 6 0.013 0.005
2 95616 116 6 0.013 0.023
18 95051 34 6 0.013 0.007
23 95054 31 5 0.010 0.006
24 95814 30 5 0.010 0.006
6 92037 54 5 0.010 0.011
12 90245 50 5 0.010 0.010
73 94928 15 5 0.010 0.003
43 91380 22 5 0.010 0.004
44 92612 22 4 0.008 0.004
36 95060 25 4 0.008 0.005
32 93407 26 4 0.008 0.005
30 95039 26 4 0.008 0.005
In [197]:
#sort in descending order for % of accounts out of total number
zipLoanTrue_descending_df = zipLoanTrue.sort_values(['PercentAccounts'],ascending=False)
zipLoanTrue_descending_df.head(50)
Out[197]:
ZIPCode NoAccounts PLoans PercentLoans PercentAccounts
0 94720 169 19 0.040 0.034
1 94305 127 13 0.027 0.025
2 95616 116 6 0.013 0.023
3 90095 71 8 0.017 0.014
4 93106 57 4 0.008 0.011
5 93943 54 4 0.008 0.011
6 92037 54 5 0.010 0.011
7 91320 53 2 0.004 0.011
8 91711 52 4 0.008 0.010
9 94025 52 4 0.008 0.010
10 92093 51 9 0.019 0.010
11 90024 50 1 0.002 0.010
12 90245 50 5 0.010 0.010
13 90089 46 8 0.017 0.009
14 91330 46 3 0.006 0.009
16 94304 45 8 0.017 0.009
15 92121 45 3 0.006 0.009
17 94143 37 3 0.006 0.007
18 95051 34 6 0.013 0.007
19 94608 34 1 0.002 0.007
20 92182 32 7 0.015 0.006
21 92028 32 4 0.008 0.006
22 92521 32 3 0.006 0.006
23 95054 31 5 0.010 0.006
24 95814 30 5 0.010 0.006
25 95014 29 4 0.008 0.006
26 94542 27 3 0.006 0.005
27 94301 27 2 0.004 0.005
31 95819 26 1 0.002 0.005
32 93407 26 4 0.008 0.005
28 95064 26 3 0.006 0.005
30 95039 26 4 0.008 0.005
29 94501 26 2 0.004 0.005
33 94105 25 1 0.002 0.005
34 91107 25 3 0.006 0.005
35 94022 25 6 0.013 0.005
36 95060 25 4 0.008 0.005
37 94303 25 2 0.004 0.005
38 93117 24 1 0.002 0.005
39 94596 24 4 0.008 0.005
40 93555 23 2 0.004 0.005
41 94080 23 1 0.002 0.005
42 95521 23 1 0.002 0.005
45 92717 22 3 0.006 0.004
43 91380 22 5 0.010 0.004
44 92612 22 4 0.008 0.004
46 92647 21 2 0.004 0.004
47 94110 21 1 0.002 0.004
48 91768 21 2 0.004 0.004
49 90034 20 1 0.002 0.004

We can focus in zip codes that have more than 1 personal loans

In [198]:
zipLoanTrue_descending_loan2_df = zipLoanTrue_descending_df.loc[zipLoanTrue_descending_df['PLoans']>=2]
zipLoanTrue_descending_loan2_df
Out[198]:
ZIPCode NoAccounts PLoans PercentLoans PercentAccounts
0 94720 169 19 0.040 0.034
1 94305 127 13 0.027 0.025
2 95616 116 6 0.013 0.023
3 90095 71 8 0.017 0.014
4 93106 57 4 0.008 0.011
5 93943 54 4 0.008 0.011
6 92037 54 5 0.010 0.011
7 91320 53 2 0.004 0.011
8 91711 52 4 0.008 0.010
9 94025 52 4 0.008 0.010
10 92093 51 9 0.019 0.010
12 90245 50 5 0.010 0.010
13 90089 46 8 0.017 0.009
14 91330 46 3 0.006 0.009
16 94304 45 8 0.017 0.009
15 92121 45 3 0.006 0.009
17 94143 37 3 0.006 0.007
18 95051 34 6 0.013 0.007
20 92182 32 7 0.015 0.006
21 92028 32 4 0.008 0.006
22 92521 32 3 0.006 0.006
23 95054 31 5 0.010 0.006
24 95814 30 5 0.010 0.006
25 95014 29 4 0.008 0.006
26 94542 27 3 0.006 0.005
27 94301 27 2 0.004 0.005
32 93407 26 4 0.008 0.005
28 95064 26 3 0.006 0.005
30 95039 26 4 0.008 0.005
29 94501 26 2 0.004 0.005
34 91107 25 3 0.006 0.005
35 94022 25 6 0.013 0.005
36 95060 25 4 0.008 0.005
37 94303 25 2 0.004 0.005
39 94596 24 4 0.008 0.005
40 93555 23 2 0.004 0.005
45 92717 22 3 0.006 0.004
43 91380 22 5 0.010 0.004
44 92612 22 4 0.008 0.004
46 92647 21 2 0.004 0.004
48 91768 21 2 0.004 0.004
50 92122 19 2 0.004 0.004
51 92697 19 2 0.004 0.004
52 90025 19 2 0.004 0.004
53 95747 19 2 0.004 0.004
57 90291 18 3 0.006 0.004
59 94122 18 2 0.004 0.004
58 93940 18 2 0.004 0.004
56 94309 18 2 0.004 0.004
55 94709 18 2 0.004 0.004
54 90840 18 2 0.004 0.004
63 94583 17 2 0.004 0.003
61 90630 17 2 0.004 0.003
60 93023 17 3 0.006 0.003
68 92677 16 3 0.006 0.003
69 95136 16 2 0.004 0.003
71 92126 16 2 0.004 0.003
78 94061 15 2 0.004 0.003
77 91604 15 2 0.004 0.003
73 94928 15 5 0.010 0.003
79 92152 14 3 0.006 0.003
81 94590 14 2 0.004 0.003
84 90064 14 2 0.004 0.003
86 94111 14 2 0.004 0.003
91 92007 13 4 0.008 0.003
88 94704 13 2 0.004 0.003
94 91355 12 4 0.008 0.002
97 93955 12 4 0.008 0.002
98 93108 12 3 0.006 0.002
100 95818 12 2 0.004 0.002
108 94801 11 2 0.004 0.002
105 95008 11 2 0.004 0.002
103 92008 11 2 0.004 0.002
101 94306 11 2 0.004 0.002
124 93014 10 2 0.004 0.002
123 91360 10 2 0.004 0.002
122 92173 10 2 0.004 0.002
120 94086 10 2 0.004 0.002
119 95032 10 3 0.006 0.002
114 91902 10 2 0.004 0.002
112 92646 10 3 0.006 0.002
111 90405 10 2 0.004 0.002
134 91302 9 3 0.006 0.002
133 94131 9 2 0.004 0.002
132 94102 9 2 0.004 0.002
129 92660 9 2 0.004 0.002
128 91103 9 2 0.004 0.002
127 90049 9 3 0.006 0.002
126 95605 9 2 0.004 0.002
145 93561 8 2 0.004 0.002
144 92626 8 2 0.004 0.002
139 91101 8 2 0.004 0.002
140 90065 8 2 0.004 0.002
138 92672 8 2 0.004 0.002
160 95138 7 2 0.004 0.001
170 92220 7 2 0.004 0.001
164 91423 7 2 0.004 0.001
159 92333 7 2 0.004 0.001
155 94114 7 2 0.004 0.001
171 92056 6 3 0.006 0.001
177 94553 6 2 0.004 0.001
179 90502 6 2 0.004 0.001
202 93022 5 2 0.004 0.001
210 94108 4 2 0.004 0.001
208 90059 4 2 0.004 0.001
207 94705 4 2 0.004 0.001
222 96008 3 2 0.004 0.001
221 95135 3 2 0.004 0.001

ZIP Code 94720 has 19 PL, 94305 has 13 PL, 92093 has 9 PL, 90095, 94304, 90089 have 8 PL, 92182 has 7 PL, 94022, 95051, 95616 have 6 PL, 95054, 95814, 92037, 90245, 94928,91380 have 5 PL

ZIP Code 94720 is the UC Berkeley campus.

ZIP Code 94305 contains the Stanford Univ campus.

ZIP Code 92093 contains the UC Santa Clara and U San Diego campus

ZIP Code 90095: contains UCLA

ZIP Code 94304: Palo Alto, CA close to Stanford

ZIP Code 90089: contains USC or very near to USC

ZIP Code 92182: contains San Diego State Univ.

ZIP Code 94022: this zip code also has many schools, like Palo Alto University, Foothill College, etc.

ZIP Code 95051: Mission College, Santa Clara Univ, etc.

ZIP Code 95616: UC at Davis

ZIP Code 95054: Santa Clara, CA

ZIP Code 95814: Sacramento, CA

ZIP Code 92037: La Jolla, CA

ZIP Code 90245: El Segundo, CA

ZIP Code 94928,

ZIP Code 91380

Is this a clue that college students take out loans, which would mean that customers that are in college zip codes, are more likely than to take a loan than customers that are not in college?

We can later take a look at the age group to see if these match a profile of a student age (beware of non-traditional students). However because many of the ZIP codes are in school areas, there could be some kind of association with the schools (faculty, or be a post-graduate student, or work there for example or even be a non-traditional student). They also seem to be mostly experienced, have large incomes.

Family vs Personal Loan¶

In [199]:
distribution_plot_wrt_target(ploancp_df, "Family", "Personal_Loan")

Notice that customers with 3 and 4 family members have more Personal Loans.

In [200]:
stacked_barplot(ploancp_df, "Family", "Personal_Loan")
Personal_Loan     0    1   All
Family                        
All            4520  480  5000
4              1088  134  1222
3               877  133  1010
1              1365  107  1472
2              1190  106  1296
------------------------------------------------------------------------------------------------------------------------

Family size 4 have the largest amount of Personal Loan and the highest percentage within its group. Family size 3 have the second largest percentage within its group.

Credit Card Average Expenses vs Personal Loan¶

Credit Card Average as Year Income Percentage vs Personal Loan

In [201]:
#lets create a new column to represent the % of CCAVg per month over Average monthly Income.
ploancp_df['CCAvg_Inc%']=(ploancp_df['CCAvg']*100/(ploancp_df['Income']/12))
ploancp_df['CCAvg_Inc%']
Out[201]:
0       39.184
1       52.941
2      109.091
3       32.400
4       26.667
         ...  
4995    57.000
4996    32.000
4997    15.000
4998    12.245
4999    11.566
Name: CCAvg_Inc%, Length: 5000, dtype: float64

Credit card usage as a percentage of Income is more meaningful than an absolute value.

In [202]:
ploancp_df['CCAvg_Inc%'].value_counts()
ploancp_df['CCAvg_Inc%'].max()
ploancp_df['CCAvg_Inc%'].min()
Out[202]:
0.0
In [203]:
plt.figure(figsize=(6,4), dpi= 60)
plt.title("Credit Card Average Usage % of Income Distribution ", fontsize=16)
sns.boxplot(data=ploancp_df,x='CCAvg_Inc%')
plt.show()

Notice the outliers above 90% monthly income expense in monthly credit card usage.

So lets drop the credit card Usage CCAvg field.

We should however later take a look in the bivariate analysis, of how Credit card usage looks against Personal Loans.

In [204]:
ploancp_df.drop('CCAvg', axis=1, inplace=True)

Maybe we can create bins to simplify our analysis?

Lets create bins of every 10%

In [205]:
bins = [-0.1, 0, 10, 20, 30,40, 50, 60,70,80, 90, 100, 110, 120, 130, 140, 150]
labels = [0, 10, 20, 30,40, 50, 60,70,80, 90, 100, 110, 120, 130, 140, 150]
ploancp_df['CCAvg_Inc%_bin'] = pd.cut(ploancp_df['CCAvg_Inc%'], bins=bins, labels=labels)
print (ploancp_df)
      Age  Experience  Income ZIPCode  Family Education  Mortgage  \
0      25           1      49   91107       4         1         0   
1      45          19      34   90089       3         1         0   
2      39          15      11   94720       1         1         0   
3      35           9     100   94112       1         2         0   
4      35           8      45   91330       4         2         0   
...   ...         ...     ...     ...     ...       ...       ...   
4995   29           3      40   92697       1         3         0   
4996   30           4      15   92037       4         1        85   
4997   63          39      24   93023       2         3         0   
4998   65          40      49   90034       3         2         0   
4999   28           4      83   92612       3         1         0   

     Personal_Loan Securities_Account CD_Account Online CreditCard Income_bin  \
0                0                  1          0      0          0   [25..50)   
1                0                  1          0      0          0   [25..50)   
2                0                  0          0      0          0    [0..25)   
3                0                  0          0      0          0  [75..100)   
4                0                  0          0      0          1   [25..50)   
...            ...                ...        ...    ...        ...        ...   
4995             0                  0          0      1          0   [25..50)   
4996             0                  0          0      1          0    [0..25)   
4997             0                  0          0      0          0    [0..25)   
4998             0                  0          0      1          0   [25..50)   
4999             0                  0          0      1          1  [75..100)   

      CCAvg_Inc% CCAvg_Inc%_bin  
0         39.184             40  
1         52.941             60  
2        109.091            110  
3         32.400             40  
4         26.667             30  
...          ...            ...  
4995      57.000             60  
4996      32.000             40  
4997      15.000             20  
4998      12.245             20  
4999      11.566             20  

[5000 rows x 15 columns]
In [206]:
ploancp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Age                 5000 non-null   int64   
 1   Experience          5000 non-null   int64   
 2   Income              5000 non-null   int64   
 3   ZIPCode             5000 non-null   object  
 4   Family              5000 non-null   int64   
 5   Education           5000 non-null   object  
 6   Mortgage            5000 non-null   int64   
 7   Personal_Loan       5000 non-null   object  
 8   Securities_Account  5000 non-null   object  
 9   CD_Account          5000 non-null   object  
 10  Online              5000 non-null   object  
 11  CreditCard          5000 non-null   object  
 12  Income_bin          5000 non-null   category
 13  CCAvg_Inc%          5000 non-null   float64 
 14  CCAvg_Inc%_bin      5000 non-null   category
dtypes: category(2), float64(1), int64(5), object(7)
memory usage: 518.7+ KB
In [207]:
distribution_plot_wrt_target(ploancp_df, "CCAvg_Inc%","Personal_Loan")

Credit Card Average as year Income Percentage Bins vs Personal Loan

In [208]:
distribution_plot_wrt_target(ploancp_df, "CCAvg_Inc%_bin","Personal_Loan")
In [209]:
stacked_barplot(ploancp_df, "CCAvg_Inc%_bin", "Personal_Loan")
Personal_Loan      0    1   All
CCAvg_Inc%_bin                 
All             4520  480  5000
50               590  111   701
60               526   88   614
30               785   80   865
40               764   77   841
10               562   60   622
20               727   60   787
70               227    3   230
0                105    1   106
80               146    0   146
90                48    0    48
100               25    0    25
110                7    0     7
120                3    0     3
140                4    0     4
150                1    0     1
------------------------------------------------------------------------------------------------------------------------

Notice that customers with average 40-50% credit card usage as % of their monthly income have more Personal Loans than the rest.

Customers that have 50%, 60% of Credit card expenses as a % of their monthly income, also have the largest % of Personal Loans within their group.

Education vs Personal Loan¶

In [210]:
distribution_plot_wrt_target(ploancp_df, "Education", "Personal_Loan")
In [211]:
stacked_barplot(ploancp_df, "Education", "Personal_Loan")
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------

On Education, we can see that Customers with Advanced/Professional education take more Personal Loans than customers w/ graduate education. Customers with Graduate education take more loans than customers with undegraduate education.

Assuming that an Advanced Degree/Professional Degree is more education than a Graduate Degree, we can confidently claim that the more Education, the more Loans customers get.

Mortgage vs Personal Loan¶

How about Mortgage as a % of Income?

In [212]:
#lets create a new column to represent the % of Mortage over yearly Income.
ploancp_df['Mortgage_Inc%']=(ploancp_df['Mortgage']*100/ploancp_df['Income'])
ploancp_df['Mortgage_Inc%']
Out[212]:
0        0.000
1        0.000
2        0.000
3        0.000
4        0.000
         ...  
4995     0.000
4996   566.667
4997     0.000
4998     0.000
4999     0.000
Name: Mortgage_Inc%, Length: 5000, dtype: float64
In [213]:
#Mortage Distribution as a % of Yearly Income
plt.figure(figsize=(6,4), dpi= 60)
plt.title("Total Mortgage as % of Yearly Income Distribution", fontsize=16)
sns.boxplot(data=ploancp_df,x='Mortgage_Inc%')
plt.show()

Because Mortgages are spread through long periods of 20-30 years, the % will be very high. However, what if we dare to see the mortgage as a % of yearly income matched to mortgage yearly cost (simple w/o annuities, amortizations, etc)?

According to Google, the average length of a Mortgage is 30 years.

In [214]:
#lets create a new column to represent the % of yearly Mortage over yearly Income.
ploancp_df['year_Mortgage_Inc%']=((ploancp_df['Mortgage']/30)*100/ploancp_df['Income'])
ploancp_df['year_Mortgage_Inc%']
Out[214]:
0       0.000
1       0.000
2       0.000
3       0.000
4       0.000
        ...  
4995    0.000
4996   18.889
4997    0.000
4998    0.000
4999    0.000
Name: year_Mortgage_Inc%, Length: 5000, dtype: float64
In [215]:
#Yearly Mortage Distribution as a % of Yearly Income
plt.figure(figsize=(6,4), dpi= 60)
plt.title("Yearly Mortgage Amortization as % Yearly Income Distribution", fontsize=16)
sns.boxplot(data=ploancp_df,x='year_Mortgage_Inc%')
plt.show()

Mortgage as Yearly Income Percentage vs Personal Loan

In [216]:
distribution_plot_wrt_target(ploancp_df, "year_Mortgage_Inc%","Personal_Loan")

Mortgage as Yearly Income Percentage Bins vs Personal Loan

In [217]:
bins = [-0.1,0, 5, 10, 15, 20,25, 30, 35,40,45]
labels = [0, 5, 10, 15, 20,25, 30, 35,40,45]
ploancp_df['year_Mortage_Inc%_bin'] = pd.cut(ploancp_df['year_Mortgage_Inc%'], bins=bins, labels=labels)
print (ploancp_df)
      Age  Experience  Income ZIPCode  Family Education  Mortgage  \
0      25           1      49   91107       4         1         0   
1      45          19      34   90089       3         1         0   
2      39          15      11   94720       1         1         0   
3      35           9     100   94112       1         2         0   
4      35           8      45   91330       4         2         0   
...   ...         ...     ...     ...     ...       ...       ...   
4995   29           3      40   92697       1         3         0   
4996   30           4      15   92037       4         1        85   
4997   63          39      24   93023       2         3         0   
4998   65          40      49   90034       3         2         0   
4999   28           4      83   92612       3         1         0   

     Personal_Loan Securities_Account CD_Account Online CreditCard Income_bin  \
0                0                  1          0      0          0   [25..50)   
1                0                  1          0      0          0   [25..50)   
2                0                  0          0      0          0    [0..25)   
3                0                  0          0      0          0  [75..100)   
4                0                  0          0      0          1   [25..50)   
...            ...                ...        ...    ...        ...        ...   
4995             0                  0          0      1          0   [25..50)   
4996             0                  0          0      1          0    [0..25)   
4997             0                  0          0      0          0    [0..25)   
4998             0                  0          0      1          0   [25..50)   
4999             0                  0          0      1          1  [75..100)   

      CCAvg_Inc% CCAvg_Inc%_bin  Mortgage_Inc%  year_Mortgage_Inc%  \
0         39.184             40          0.000               0.000   
1         52.941             60          0.000               0.000   
2        109.091            110          0.000               0.000   
3         32.400             40          0.000               0.000   
4         26.667             30          0.000               0.000   
...          ...            ...            ...                 ...   
4995      57.000             60          0.000               0.000   
4996      32.000             40        566.667              18.889   
4997      15.000             20          0.000               0.000   
4998      12.245             20          0.000               0.000   
4999      11.566             20          0.000               0.000   

     year_Mortage_Inc%_bin  
0                        0  
1                        0  
2                        0  
3                        0  
4                        0  
...                    ...  
4995                     0  
4996                    20  
4997                     0  
4998                     0  
4999                     0  

[5000 rows x 18 columns]
In [218]:
#ploancp_df['year_Mortage_Inc%_bin']
ploancp_df['year_Mortage_Inc%_bin'] = ploancp_df['year_Mortage_Inc%_bin'].astype('int')
In [219]:
distribution_plot_wrt_target(ploancp_df, "year_Mortage_Inc%_bin","Personal_Loan")
In [220]:
sns.countplot(ploancp_df["year_Mortage_Inc%_bin"])
plt.title("Mortgage Distribution as Yearly Income % - 5% bins")
plt.show()

We notice that there is a customer that pays more than 40% of his/her yearly income in mortage. But we will analyze this in the bivariate analysis better.

Now Lets drop the columns 'Mortgage' and 'Mortgage_Inc%"

In [221]:
ploancp_df.drop(['Mortgage','Mortgage_Inc%'], axis=1, inplace=True)
In [222]:
#lets look at the mortage
ploancp_df['year_Mortgage_Inc%'].value_counts()
Out[222]:
0.000     3462
10.000       5
14.545       5
11.429       4
4.667        4
          ... 
5.805        1
10.000       1
24.524       1
10.397       1
18.889       1
Name: year_Mortgage_Inc%, Length: 1392, dtype: int64
In [223]:
ploancp_df['year_Mortgage_Inc%'].max()
ploancp_df['year_Mortgage_Inc%'].min()
Out[223]:
0.0

lets bin the mortages

In [224]:
stacked_barplot(ploancp_df, "year_Mortage_Inc%_bin", "Personal_Loan")
Personal_Loan             0    1   All
year_Mortage_Inc%_bin                 
All                    4520  480  5000
0                      3150  312  3462
10                      504   77   581
5                       149   62   211
15                      484   29   513
20                      143    0   143
25                       43    0    43
30                       26    0    26
35                       12    0    12
40                        7    0     7
45                        2    0     2
------------------------------------------------------------------------------------------------------------------------

Notice that customers that have mortgages and that mortage is 5%-10% of their yearly income have the largest % of personal loans

Securities Account vs Personal Loan¶

In [225]:
distribution_plot_wrt_target(ploancp_df, "Securities_Account", "Personal_Loan")
In [226]:
stacked_barplot(ploancp_df, "Securities_Account", "Personal_Loan")
Personal_Loan          0    1   All
Securities_Account                 
All                 4520  480  5000
0                   4058  420  4478
1                    462   60   522
------------------------------------------------------------------------------------------------------------------------

Customers with Securities Account have a slightly largest % of personal Loans within their group.

CD Account vs Personal Loan¶

In [227]:
distribution_plot_wrt_target(ploancp_df, "CD_Account", "Personal_Loan")
In [228]:
stacked_barplot(ploancp_df, "CD_Account", "Personal_Loan")
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------

Customers with CD Accounts also tend to have more Personal Loans % within their group than customers w/o a CD Account.

Online vs Personal Loan¶

In [229]:
distribution_plot_wrt_target(ploancp_df, "Online", "Personal_Loan")
In [230]:
stacked_barplot(ploancp_df, "Online", "Personal_Loan")
Personal_Loan     0    1   All
Online                        
All            4520  480  5000
1              2693  291  2984
0              1827  189  2016
------------------------------------------------------------------------------------------------------------------------

Customers that use Online access and customers that don't use Online access are very similar in the amount of Personal Loans.

Credit Card vs Personal Loan¶

In [231]:
distribution_plot_wrt_target(ploancp_df, "CreditCard", "Personal_Loan")
In [232]:
stacked_barplot(ploancp_df, "CreditCard", "Personal_Loan")
Personal_Loan     0    1   All
CreditCard                    
All            4520  480  5000
0              3193  337  3530
1              1327  143  1470
------------------------------------------------------------------------------------------------------------------------

There are more customers w/o a Credit Card that have a Personal Loans than those with a credit card. However, percentage wise within their group, a larger % within their group of Customers that have Credit cards than those w/o credit cards.

Second Data Preparation¶

In [233]:
ploancp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   Age                    5000 non-null   int64   
 1   Experience             5000 non-null   int64   
 2   Income                 5000 non-null   int64   
 3   ZIPCode                5000 non-null   object  
 4   Family                 5000 non-null   int64   
 5   Education              5000 non-null   object  
 6   Personal_Loan          5000 non-null   object  
 7   Securities_Account     5000 non-null   object  
 8   CD_Account             5000 non-null   object  
 9   Online                 5000 non-null   object  
 10  CreditCard             5000 non-null   object  
 11  Income_bin             5000 non-null   category
 12  CCAvg_Inc%             5000 non-null   float64 
 13  CCAvg_Inc%_bin         5000 non-null   category
 14  year_Mortgage_Inc%     5000 non-null   float64 
 15  year_Mortage_Inc%_bin  5000 non-null   int64   
dtypes: category(2), float64(2), int64(5), object(7)
memory usage: 557.8+ KB

We can drop year_Mortage_Inc%_bin, CCAvg_Inc%_bin, Income_bin.

In [234]:
ploancp_df.drop(['year_Mortage_Inc%_bin', 'CCAvg_Inc%_bin', 'Income_bin'], axis=1,inplace=True)
In [235]:
ploancp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIPCode             5000 non-null   object 
 4   Family              5000 non-null   int64  
 5   Education           5000 non-null   object 
 6   Personal_Loan       5000 non-null   object 
 7   Securities_Account  5000 non-null   object 
 8   CD_Account          5000 non-null   object 
 9   Online              5000 non-null   object 
 10  CreditCard          5000 non-null   object 
 11  CCAvg_Inc%          5000 non-null   float64
 12  year_Mortgage_Inc%  5000 non-null   float64
dtypes: float64(2), int64(4), object(7)
memory usage: 507.9+ KB
In [236]:
numerical_col = ploancp_df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20, 30))

for i, variable in enumerate(numerical_col):
    plt.subplot(5, 4, i + 1)
    plt.boxplot(ploancp_df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
In [237]:
ploancp_df.dtypes
Out[237]:
Age                     int64
Experience              int64
Income                  int64
ZIPCode                object
Family                  int64
Education              object
Personal_Loan          object
Securities_Account     object
CD_Account             object
Online                 object
CreditCard             object
CCAvg_Inc%            float64
year_Mortgage_Inc%    float64
dtype: object

Logistic Regression Modelling¶

Should we hot-encode or dummy any features?

  1. Personal Loan, Securities Account, CD_Account, Online, CreditCard are ok as they are since they are either a 1 for Yes, and 0 for No.

  2. The big issue is ZIPCodes. Although there are over 400 zipcodes, how else could we introduce ZIPcodes? Will we be overfitting the model? Should we stick to dimension reduction by binning zip codes?

However, if we bin zip codes, like for example, regions or states, we may lose data.

We will aim to hot encode zip codes.

First, lets convert our Objects to integers 0 or 1 w/o considering ZIPCode for the time being.

In [238]:
cols_num2cat = ['Education','Personal_Loan','Securities_Account','CD_Account','Online','CreditCard','CCAvg_Inc%','year_Mortgage_Inc%']
ploancp_df[cols_num2cat] = ploancp_df[cols_num2cat].astype('int')

Lets take a look at the correlation between numerical features.

In [239]:
fig, ax = plt.subplots(figsize=(12,10))
ax=sns.heatmap(ploancp_df.corr(), cmap="YlGnBu", annot=True)

Correlation Levels for Features vs Personal Loan Medium Correlation : Income and Personal Loan (0.5) Medium Low Correlation: CD_Account and Personal Loan (0.32) Low Correlation : Education and Personal Loan (0.14)

Now lets hot-encode (dummy) the ZIP Codes.

In [240]:
ploancp_df= pd.get_dummies(ploancp_df, columns = ['ZIPCode'])
In [241]:
ploancp_df.head()
Out[241]:
Age Experience Income Family Education Personal_Loan Securities_Account CD_Account Online CreditCard CCAvg_Inc% year_Mortgage_Inc% ZIPCode_90005 ZIPCode_90007 ZIPCode_90009 ZIPCode_90011 ZIPCode_90016 ZIPCode_90018 ZIPCode_90019 ZIPCode_90024 ZIPCode_90025 ZIPCode_90027 ZIPCode_90028 ZIPCode_90029 ZIPCode_90032 ZIPCode_90033 ZIPCode_90034 ZIPCode_90035 ZIPCode_90036 ZIPCode_90037 ZIPCode_90041 ZIPCode_90044 ZIPCode_90045 ZIPCode_90048 ZIPCode_90049 ZIPCode_90057 ZIPCode_90058 ZIPCode_90059 ZIPCode_90064 ZIPCode_90065 ZIPCode_90066 ZIPCode_90068 ZIPCode_90071 ZIPCode_90073 ZIPCode_90086 ZIPCode_90089 ZIPCode_90095 ZIPCode_90210 ZIPCode_90212 ZIPCode_90230 ZIPCode_90232 ZIPCode_90245 ZIPCode_90250 ZIPCode_90254 ZIPCode_90266 ZIPCode_90272 ZIPCode_90274 ZIPCode_90275 ZIPCode_90277 ZIPCode_90280 ZIPCode_90291 ZIPCode_90304 ZIPCode_90401 ZIPCode_90404 ZIPCode_90405 ZIPCode_90502 ZIPCode_90503 ZIPCode_90504 ZIPCode_90505 ZIPCode_90509 ZIPCode_90601 ZIPCode_90623 ZIPCode_90630 ZIPCode_90638 ZIPCode_90639 ZIPCode_90640 ZIPCode_90650 ZIPCode_90717 ZIPCode_90720 ZIPCode_90740 ZIPCode_90745 ZIPCode_90747 ZIPCode_90755 ZIPCode_90813 ZIPCode_90840 ZIPCode_91006 ZIPCode_91007 ZIPCode_91016 ZIPCode_91024 ZIPCode_91030 ZIPCode_91040 ZIPCode_91101 ZIPCode_91103 ZIPCode_91105 ZIPCode_91107 ZIPCode_91109 ZIPCode_91116 ZIPCode_91125 ZIPCode_91129 ZIPCode_91203 ZIPCode_91207 ZIPCode_91301 ZIPCode_91302 ZIPCode_91304 ZIPCode_91311 ZIPCode_91320 ZIPCode_91326 ZIPCode_91330 ZIPCode_91335 ZIPCode_91342 ZIPCode_91343 ZIPCode_91345 ZIPCode_91355 ZIPCode_91360 ZIPCode_91361 ZIPCode_91365 ZIPCode_91367 ZIPCode_91380 ZIPCode_91401 ZIPCode_91423 ZIPCode_91604 ZIPCode_91605 ZIPCode_91614 ZIPCode_91706 ZIPCode_91709 ZIPCode_91710 ZIPCode_91711 ZIPCode_91730 ZIPCode_91741 ZIPCode_91745 ZIPCode_91754 ZIPCode_91763 ZIPCode_91765 ZIPCode_91768 ZIPCode_91770 ZIPCode_91773 ZIPCode_91775 ZIPCode_91784 ZIPCode_91791 ZIPCode_91801 ZIPCode_91902 ZIPCode_91910 ZIPCode_91911 ZIPCode_91941 ZIPCode_91942 ZIPCode_91950 ZIPCode_92007 ZIPCode_92008 ZIPCode_92009 ZIPCode_92024 ZIPCode_92028 ZIPCode_92029 ZIPCode_92037 ZIPCode_92038 ZIPCode_92054 ZIPCode_92056 ZIPCode_92064 ZIPCode_92068 ZIPCode_92069 ZIPCode_92084 ZIPCode_92093 ZIPCode_92096 ZIPCode_92101 ZIPCode_92103 ZIPCode_92104 ZIPCode_92106 ZIPCode_92109 ZIPCode_92110 ZIPCode_92115 ZIPCode_92116 ZIPCode_92120 ZIPCode_92121 ZIPCode_92122 ZIPCode_92123 ZIPCode_92124 ZIPCode_92126 ZIPCode_92129 ZIPCode_92130 ZIPCode_92131 ZIPCode_92152 ZIPCode_92154 ZIPCode_92161 ZIPCode_92173 ZIPCode_92177 ZIPCode_92182 ZIPCode_92192 ZIPCode_92220 ZIPCode_92251 ZIPCode_92325 ZIPCode_92333 ZIPCode_92346 ZIPCode_92350 ZIPCode_92354 ZIPCode_92373 ZIPCode_92374 ZIPCode_92399 ZIPCode_92407 ZIPCode_92507 ZIPCode_92518 ZIPCode_92521 ZIPCode_92606 ZIPCode_92612 ZIPCode_92614 ZIPCode_92624 ZIPCode_92626 ZIPCode_92630 ZIPCode_92634 ZIPCode_92646 ZIPCode_92647 ZIPCode_92648 ZIPCode_92653 ZIPCode_92660 ZIPCode_92661 ZIPCode_92672 ZIPCode_92673 ZIPCode_92675 ZIPCode_92677 ZIPCode_92691 ZIPCode_92692 ZIPCode_92694 ZIPCode_92697 ZIPCode_92703 ZIPCode_92704 ZIPCode_92705 ZIPCode_92709 ZIPCode_92717 ZIPCode_92735 ZIPCode_92780 ZIPCode_92806 ZIPCode_92807 ZIPCode_92821 ZIPCode_92831 ZIPCode_92833 ZIPCode_92834 ZIPCode_92835 ZIPCode_92843 ZIPCode_92866 ZIPCode_92867 ZIPCode_92868 ZIPCode_92870 ZIPCode_92886 ZIPCode_93003 ZIPCode_93009 ZIPCode_93010 ZIPCode_93014 ZIPCode_93022 ZIPCode_93023 ZIPCode_93033 ZIPCode_93063 ZIPCode_93065 ZIPCode_93077 ZIPCode_93101 ZIPCode_93105 ZIPCode_93106 ZIPCode_93107 ZIPCode_93108 ZIPCode_93109 ZIPCode_93111 ZIPCode_93117 ZIPCode_93118 ZIPCode_93302 ZIPCode_93305 ZIPCode_93311 ZIPCode_93401 ZIPCode_93403 ZIPCode_93407 ZIPCode_93437 ZIPCode_93460 ZIPCode_93524 ZIPCode_93555 ZIPCode_93561 ZIPCode_93611 ZIPCode_93657 ZIPCode_93711 ZIPCode_93720 ZIPCode_93727 ZIPCode_93907 ZIPCode_93933 ZIPCode_93940 ZIPCode_93943 ZIPCode_93950 ZIPCode_93955 ZIPCode_94002 ZIPCode_94005 ZIPCode_94010 ZIPCode_94015 ZIPCode_94019 ZIPCode_94022 ZIPCode_94024 ZIPCode_94025 ZIPCode_94028 ZIPCode_94035 ZIPCode_94040 ZIPCode_94043 ZIPCode_94061 ZIPCode_94063 ZIPCode_94065 ZIPCode_94066 ZIPCode_94080 ZIPCode_94085 ZIPCode_94086 ZIPCode_94087 ZIPCode_94102 ZIPCode_94104 ZIPCode_94105 ZIPCode_94107 ZIPCode_94108 ZIPCode_94109 ZIPCode_94110 ZIPCode_94111 ZIPCode_94112 ZIPCode_94114 ZIPCode_94115 ZIPCode_94116 ZIPCode_94117 ZIPCode_94118 ZIPCode_94122 ZIPCode_94123 ZIPCode_94124 ZIPCode_94126 ZIPCode_94131 ZIPCode_94132 ZIPCode_94143 ZIPCode_94234 ZIPCode_94301 ZIPCode_94302 ZIPCode_94303 ZIPCode_94304 ZIPCode_94305 ZIPCode_94306 ZIPCode_94309 ZIPCode_94402 ZIPCode_94404 ZIPCode_94501 ZIPCode_94507 ZIPCode_94509 ZIPCode_94521 ZIPCode_94523 ZIPCode_94526 ZIPCode_94534 ZIPCode_94536 ZIPCode_94538 ZIPCode_94539 ZIPCode_94542 ZIPCode_94545 ZIPCode_94546 ZIPCode_94550 ZIPCode_94551 ZIPCode_94553 ZIPCode_94555 ZIPCode_94558 ZIPCode_94566 ZIPCode_94571 ZIPCode_94575 ZIPCode_94577 ZIPCode_94583 ZIPCode_94588 ZIPCode_94590 ZIPCode_94591 ZIPCode_94596 ZIPCode_94598 ZIPCode_94604 ZIPCode_94606 ZIPCode_94607 ZIPCode_94608 ZIPCode_94609 ZIPCode_94610 ZIPCode_94611 ZIPCode_94612 ZIPCode_94618 ZIPCode_94701 ZIPCode_94703 ZIPCode_94704 ZIPCode_94705 ZIPCode_94706 ZIPCode_94707 ZIPCode_94708 ZIPCode_94709 ZIPCode_94710 ZIPCode_94720 ZIPCode_94801 ZIPCode_94803 ZIPCode_94806 ZIPCode_94901 ZIPCode_94904 ZIPCode_94920 ZIPCode_94923 ZIPCode_94928 ZIPCode_94939 ZIPCode_94949 ZIPCode_94960 ZIPCode_94965 ZIPCode_94970 ZIPCode_94998 ZIPCode_95003 ZIPCode_95005 ZIPCode_95006 ZIPCode_95008 ZIPCode_95010 ZIPCode_95014 ZIPCode_95020 ZIPCode_95023 ZIPCode_95032 ZIPCode_95035 ZIPCode_95037 ZIPCode_95039 ZIPCode_95045 ZIPCode_95051 ZIPCode_95053 ZIPCode_95054 ZIPCode_95060 ZIPCode_95064 ZIPCode_95070 ZIPCode_95112 ZIPCode_95120 ZIPCode_95123 ZIPCode_95125 ZIPCode_95126 ZIPCode_95131 ZIPCode_95133 ZIPCode_95134 ZIPCode_95135 ZIPCode_95136 ZIPCode_95138 ZIPCode_95192 ZIPCode_95193 ZIPCode_95207 ZIPCode_95211 ZIPCode_95307 ZIPCode_95348 ZIPCode_95351 ZIPCode_95354 ZIPCode_95370 ZIPCode_95403 ZIPCode_95405 ZIPCode_95422 ZIPCode_95449 ZIPCode_95482 ZIPCode_95503 ZIPCode_95518 ZIPCode_95521 ZIPCode_95605 ZIPCode_95616 ZIPCode_95617 ZIPCode_95621 ZIPCode_95630 ZIPCode_95670 ZIPCode_95678 ZIPCode_95741 ZIPCode_95747 ZIPCode_95758 ZIPCode_95762 ZIPCode_95812 ZIPCode_95814 ZIPCode_95816 ZIPCode_95817 ZIPCode_95818 ZIPCode_95819 ZIPCode_95820 ZIPCode_95821 ZIPCode_95822 ZIPCode_95825 ZIPCode_95827 ZIPCode_95828 ZIPCode_95831 ZIPCode_95833 ZIPCode_95841 ZIPCode_95842 ZIPCode_95929 ZIPCode_95973 ZIPCode_96001 ZIPCode_96003 ZIPCode_96008 ZIPCode_96064 ZIPCode_96091 ZIPCode_96094 ZIPCode_96145 ZIPCode_96150 ZIPCode_96651
0 25 1 49 4 1 0 1 0 0 0 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 45 19 34 3 1 0 1 0 0 0 52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 39 15 11 1 1 0 0 0 0 0 109 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 35 9 100 1 2 0 0 0 0 0 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 35 8 45 4 2 0 0 0 0 1 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [242]:
ploancp_df.dtypes
Out[242]:
Age              int64
Experience       int64
Income           int64
Family           int64
Education        int64
                 ...  
ZIPCode_96091    uint8
ZIPCode_96094    uint8
ZIPCode_96145    uint8
ZIPCode_96150    uint8
ZIPCode_96651    uint8
Length: 479, dtype: object

Lets drop the Target Feature (Personal Loan)

In [243]:
X = ploancp_df.drop(["Personal_Loan"], axis=1)
Y = ploancp_df["Personal_Loan"]

X = pd.get_dummies(X, drop_first=True)
In [244]:
X
Out[244]:
Age Experience Income Family Education Securities_Account CD_Account Online CreditCard CCAvg_Inc% year_Mortgage_Inc% ZIPCode_90005 ZIPCode_90007 ZIPCode_90009 ZIPCode_90011 ZIPCode_90016 ZIPCode_90018 ZIPCode_90019 ZIPCode_90024 ZIPCode_90025 ZIPCode_90027 ZIPCode_90028 ZIPCode_90029 ZIPCode_90032 ZIPCode_90033 ZIPCode_90034 ZIPCode_90035 ZIPCode_90036 ZIPCode_90037 ZIPCode_90041 ZIPCode_90044 ZIPCode_90045 ZIPCode_90048 ZIPCode_90049 ZIPCode_90057 ZIPCode_90058 ZIPCode_90059 ZIPCode_90064 ZIPCode_90065 ZIPCode_90066 ZIPCode_90068 ZIPCode_90071 ZIPCode_90073 ZIPCode_90086 ZIPCode_90089 ZIPCode_90095 ZIPCode_90210 ZIPCode_90212 ZIPCode_90230 ZIPCode_90232 ZIPCode_90245 ZIPCode_90250 ZIPCode_90254 ZIPCode_90266 ZIPCode_90272 ZIPCode_90274 ZIPCode_90275 ZIPCode_90277 ZIPCode_90280 ZIPCode_90291 ZIPCode_90304 ZIPCode_90401 ZIPCode_90404 ZIPCode_90405 ZIPCode_90502 ZIPCode_90503 ZIPCode_90504 ZIPCode_90505 ZIPCode_90509 ZIPCode_90601 ZIPCode_90623 ZIPCode_90630 ZIPCode_90638 ZIPCode_90639 ZIPCode_90640 ZIPCode_90650 ZIPCode_90717 ZIPCode_90720 ZIPCode_90740 ZIPCode_90745 ZIPCode_90747 ZIPCode_90755 ZIPCode_90813 ZIPCode_90840 ZIPCode_91006 ZIPCode_91007 ZIPCode_91016 ZIPCode_91024 ZIPCode_91030 ZIPCode_91040 ZIPCode_91101 ZIPCode_91103 ZIPCode_91105 ZIPCode_91107 ZIPCode_91109 ZIPCode_91116 ZIPCode_91125 ZIPCode_91129 ZIPCode_91203 ZIPCode_91207 ZIPCode_91301 ZIPCode_91302 ZIPCode_91304 ZIPCode_91311 ZIPCode_91320 ZIPCode_91326 ZIPCode_91330 ZIPCode_91335 ZIPCode_91342 ZIPCode_91343 ZIPCode_91345 ZIPCode_91355 ZIPCode_91360 ZIPCode_91361 ZIPCode_91365 ZIPCode_91367 ZIPCode_91380 ZIPCode_91401 ZIPCode_91423 ZIPCode_91604 ZIPCode_91605 ZIPCode_91614 ZIPCode_91706 ZIPCode_91709 ZIPCode_91710 ZIPCode_91711 ZIPCode_91730 ZIPCode_91741 ZIPCode_91745 ZIPCode_91754 ZIPCode_91763 ZIPCode_91765 ZIPCode_91768 ZIPCode_91770 ZIPCode_91773 ZIPCode_91775 ZIPCode_91784 ZIPCode_91791 ZIPCode_91801 ZIPCode_91902 ZIPCode_91910 ZIPCode_91911 ZIPCode_91941 ZIPCode_91942 ZIPCode_91950 ZIPCode_92007 ZIPCode_92008 ZIPCode_92009 ZIPCode_92024 ZIPCode_92028 ZIPCode_92029 ZIPCode_92037 ZIPCode_92038 ZIPCode_92054 ZIPCode_92056 ZIPCode_92064 ZIPCode_92068 ZIPCode_92069 ZIPCode_92084 ZIPCode_92093 ZIPCode_92096 ZIPCode_92101 ZIPCode_92103 ZIPCode_92104 ZIPCode_92106 ZIPCode_92109 ZIPCode_92110 ZIPCode_92115 ZIPCode_92116 ZIPCode_92120 ZIPCode_92121 ZIPCode_92122 ZIPCode_92123 ZIPCode_92124 ZIPCode_92126 ZIPCode_92129 ZIPCode_92130 ZIPCode_92131 ZIPCode_92152 ZIPCode_92154 ZIPCode_92161 ZIPCode_92173 ZIPCode_92177 ZIPCode_92182 ZIPCode_92192 ZIPCode_92220 ZIPCode_92251 ZIPCode_92325 ZIPCode_92333 ZIPCode_92346 ZIPCode_92350 ZIPCode_92354 ZIPCode_92373 ZIPCode_92374 ZIPCode_92399 ZIPCode_92407 ZIPCode_92507 ZIPCode_92518 ZIPCode_92521 ZIPCode_92606 ZIPCode_92612 ZIPCode_92614 ZIPCode_92624 ZIPCode_92626 ZIPCode_92630 ZIPCode_92634 ZIPCode_92646 ZIPCode_92647 ZIPCode_92648 ZIPCode_92653 ZIPCode_92660 ZIPCode_92661 ZIPCode_92672 ZIPCode_92673 ZIPCode_92675 ZIPCode_92677 ZIPCode_92691 ZIPCode_92692 ZIPCode_92694 ZIPCode_92697 ZIPCode_92703 ZIPCode_92704 ZIPCode_92705 ZIPCode_92709 ZIPCode_92717 ZIPCode_92735 ZIPCode_92780 ZIPCode_92806 ZIPCode_92807 ZIPCode_92821 ZIPCode_92831 ZIPCode_92833 ZIPCode_92834 ZIPCode_92835 ZIPCode_92843 ZIPCode_92866 ZIPCode_92867 ZIPCode_92868 ZIPCode_92870 ZIPCode_92886 ZIPCode_93003 ZIPCode_93009 ZIPCode_93010 ZIPCode_93014 ZIPCode_93022 ZIPCode_93023 ZIPCode_93033 ZIPCode_93063 ZIPCode_93065 ZIPCode_93077 ZIPCode_93101 ZIPCode_93105 ZIPCode_93106 ZIPCode_93107 ZIPCode_93108 ZIPCode_93109 ZIPCode_93111 ZIPCode_93117 ZIPCode_93118 ZIPCode_93302 ZIPCode_93305 ZIPCode_93311 ZIPCode_93401 ZIPCode_93403 ZIPCode_93407 ZIPCode_93437 ZIPCode_93460 ZIPCode_93524 ZIPCode_93555 ZIPCode_93561 ZIPCode_93611 ZIPCode_93657 ZIPCode_93711 ZIPCode_93720 ZIPCode_93727 ZIPCode_93907 ZIPCode_93933 ZIPCode_93940 ZIPCode_93943 ZIPCode_93950 ZIPCode_93955 ZIPCode_94002 ZIPCode_94005 ZIPCode_94010 ZIPCode_94015 ZIPCode_94019 ZIPCode_94022 ZIPCode_94024 ZIPCode_94025 ZIPCode_94028 ZIPCode_94035 ZIPCode_94040 ZIPCode_94043 ZIPCode_94061 ZIPCode_94063 ZIPCode_94065 ZIPCode_94066 ZIPCode_94080 ZIPCode_94085 ZIPCode_94086 ZIPCode_94087 ZIPCode_94102 ZIPCode_94104 ZIPCode_94105 ZIPCode_94107 ZIPCode_94108 ZIPCode_94109 ZIPCode_94110 ZIPCode_94111 ZIPCode_94112 ZIPCode_94114 ZIPCode_94115 ZIPCode_94116 ZIPCode_94117 ZIPCode_94118 ZIPCode_94122 ZIPCode_94123 ZIPCode_94124 ZIPCode_94126 ZIPCode_94131 ZIPCode_94132 ZIPCode_94143 ZIPCode_94234 ZIPCode_94301 ZIPCode_94302 ZIPCode_94303 ZIPCode_94304 ZIPCode_94305 ZIPCode_94306 ZIPCode_94309 ZIPCode_94402 ZIPCode_94404 ZIPCode_94501 ZIPCode_94507 ZIPCode_94509 ZIPCode_94521 ZIPCode_94523 ZIPCode_94526 ZIPCode_94534 ZIPCode_94536 ZIPCode_94538 ZIPCode_94539 ZIPCode_94542 ZIPCode_94545 ZIPCode_94546 ZIPCode_94550 ZIPCode_94551 ZIPCode_94553 ZIPCode_94555 ZIPCode_94558 ZIPCode_94566 ZIPCode_94571 ZIPCode_94575 ZIPCode_94577 ZIPCode_94583 ZIPCode_94588 ZIPCode_94590 ZIPCode_94591 ZIPCode_94596 ZIPCode_94598 ZIPCode_94604 ZIPCode_94606 ZIPCode_94607 ZIPCode_94608 ZIPCode_94609 ZIPCode_94610 ZIPCode_94611 ZIPCode_94612 ZIPCode_94618 ZIPCode_94701 ZIPCode_94703 ZIPCode_94704 ZIPCode_94705 ZIPCode_94706 ZIPCode_94707 ZIPCode_94708 ZIPCode_94709 ZIPCode_94710 ZIPCode_94720 ZIPCode_94801 ZIPCode_94803 ZIPCode_94806 ZIPCode_94901 ZIPCode_94904 ZIPCode_94920 ZIPCode_94923 ZIPCode_94928 ZIPCode_94939 ZIPCode_94949 ZIPCode_94960 ZIPCode_94965 ZIPCode_94970 ZIPCode_94998 ZIPCode_95003 ZIPCode_95005 ZIPCode_95006 ZIPCode_95008 ZIPCode_95010 ZIPCode_95014 ZIPCode_95020 ZIPCode_95023 ZIPCode_95032 ZIPCode_95035 ZIPCode_95037 ZIPCode_95039 ZIPCode_95045 ZIPCode_95051 ZIPCode_95053 ZIPCode_95054 ZIPCode_95060 ZIPCode_95064 ZIPCode_95070 ZIPCode_95112 ZIPCode_95120 ZIPCode_95123 ZIPCode_95125 ZIPCode_95126 ZIPCode_95131 ZIPCode_95133 ZIPCode_95134 ZIPCode_95135 ZIPCode_95136 ZIPCode_95138 ZIPCode_95192 ZIPCode_95193 ZIPCode_95207 ZIPCode_95211 ZIPCode_95307 ZIPCode_95348 ZIPCode_95351 ZIPCode_95354 ZIPCode_95370 ZIPCode_95403 ZIPCode_95405 ZIPCode_95422 ZIPCode_95449 ZIPCode_95482 ZIPCode_95503 ZIPCode_95518 ZIPCode_95521 ZIPCode_95605 ZIPCode_95616 ZIPCode_95617 ZIPCode_95621 ZIPCode_95630 ZIPCode_95670 ZIPCode_95678 ZIPCode_95741 ZIPCode_95747 ZIPCode_95758 ZIPCode_95762 ZIPCode_95812 ZIPCode_95814 ZIPCode_95816 ZIPCode_95817 ZIPCode_95818 ZIPCode_95819 ZIPCode_95820 ZIPCode_95821 ZIPCode_95822 ZIPCode_95825 ZIPCode_95827 ZIPCode_95828 ZIPCode_95831 ZIPCode_95833 ZIPCode_95841 ZIPCode_95842 ZIPCode_95929 ZIPCode_95973 ZIPCode_96001 ZIPCode_96003 ZIPCode_96008 ZIPCode_96064 ZIPCode_96091 ZIPCode_96094 ZIPCode_96145 ZIPCode_96150 ZIPCode_96651
0 25 1 49 4 1 1 0 0 0 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 45 19 34 3 1 1 0 0 0 52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 39 15 11 1 1 0 0 0 0 109 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 35 9 100 1 2 0 0 0 0 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 35 8 45 4 2 0 0 0 1 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 29 3 40 1 3 0 0 1 0 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4996 30 4 15 4 1 0 0 1 0 32 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4997 63 39 24 2 3 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4998 65 40 49 3 2 0 0 1 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4999 28 4 83 3 1 0 0 1 1 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5000 rows × 478 columns

Split Train and Test Sets 70-30

In [245]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [246]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3500, 478)
Shape of test set :  (1500, 478)
Percentage of classes in training set:
0   0.905
1   0.095
Name: Personal_Loan, dtype: float64
Percentage of classes in test set:
0   0.901
1   0.099
Name: Personal_Loan, dtype: float64
In [247]:
X_train
Out[247]:
Age Experience Income Family Education Securities_Account CD_Account Online CreditCard CCAvg_Inc% year_Mortgage_Inc% ZIPCode_90005 ZIPCode_90007 ZIPCode_90009 ZIPCode_90011 ZIPCode_90016 ZIPCode_90018 ZIPCode_90019 ZIPCode_90024 ZIPCode_90025 ZIPCode_90027 ZIPCode_90028 ZIPCode_90029 ZIPCode_90032 ZIPCode_90033 ZIPCode_90034 ZIPCode_90035 ZIPCode_90036 ZIPCode_90037 ZIPCode_90041 ZIPCode_90044 ZIPCode_90045 ZIPCode_90048 ZIPCode_90049 ZIPCode_90057 ZIPCode_90058 ZIPCode_90059 ZIPCode_90064 ZIPCode_90065 ZIPCode_90066 ZIPCode_90068 ZIPCode_90071 ZIPCode_90073 ZIPCode_90086 ZIPCode_90089 ZIPCode_90095 ZIPCode_90210 ZIPCode_90212 ZIPCode_90230 ZIPCode_90232 ZIPCode_90245 ZIPCode_90250 ZIPCode_90254 ZIPCode_90266 ZIPCode_90272 ZIPCode_90274 ZIPCode_90275 ZIPCode_90277 ZIPCode_90280 ZIPCode_90291 ZIPCode_90304 ZIPCode_90401 ZIPCode_90404 ZIPCode_90405 ZIPCode_90502 ZIPCode_90503 ZIPCode_90504 ZIPCode_90505 ZIPCode_90509 ZIPCode_90601 ZIPCode_90623 ZIPCode_90630 ZIPCode_90638 ZIPCode_90639 ZIPCode_90640 ZIPCode_90650 ZIPCode_90717 ZIPCode_90720 ZIPCode_90740 ZIPCode_90745 ZIPCode_90747 ZIPCode_90755 ZIPCode_90813 ZIPCode_90840 ZIPCode_91006 ZIPCode_91007 ZIPCode_91016 ZIPCode_91024 ZIPCode_91030 ZIPCode_91040 ZIPCode_91101 ZIPCode_91103 ZIPCode_91105 ZIPCode_91107 ZIPCode_91109 ZIPCode_91116 ZIPCode_91125 ZIPCode_91129 ZIPCode_91203 ZIPCode_91207 ZIPCode_91301 ZIPCode_91302 ZIPCode_91304 ZIPCode_91311 ZIPCode_91320 ZIPCode_91326 ZIPCode_91330 ZIPCode_91335 ZIPCode_91342 ZIPCode_91343 ZIPCode_91345 ZIPCode_91355 ZIPCode_91360 ZIPCode_91361 ZIPCode_91365 ZIPCode_91367 ZIPCode_91380 ZIPCode_91401 ZIPCode_91423 ZIPCode_91604 ZIPCode_91605 ZIPCode_91614 ZIPCode_91706 ZIPCode_91709 ZIPCode_91710 ZIPCode_91711 ZIPCode_91730 ZIPCode_91741 ZIPCode_91745 ZIPCode_91754 ZIPCode_91763 ZIPCode_91765 ZIPCode_91768 ZIPCode_91770 ZIPCode_91773 ZIPCode_91775 ZIPCode_91784 ZIPCode_91791 ZIPCode_91801 ZIPCode_91902 ZIPCode_91910 ZIPCode_91911 ZIPCode_91941 ZIPCode_91942 ZIPCode_91950 ZIPCode_92007 ZIPCode_92008 ZIPCode_92009 ZIPCode_92024 ZIPCode_92028 ZIPCode_92029 ZIPCode_92037 ZIPCode_92038 ZIPCode_92054 ZIPCode_92056 ZIPCode_92064 ZIPCode_92068 ZIPCode_92069 ZIPCode_92084 ZIPCode_92093 ZIPCode_92096 ZIPCode_92101 ZIPCode_92103 ZIPCode_92104 ZIPCode_92106 ZIPCode_92109 ZIPCode_92110 ZIPCode_92115 ZIPCode_92116 ZIPCode_92120 ZIPCode_92121 ZIPCode_92122 ZIPCode_92123 ZIPCode_92124 ZIPCode_92126 ZIPCode_92129 ZIPCode_92130 ZIPCode_92131 ZIPCode_92152 ZIPCode_92154 ZIPCode_92161 ZIPCode_92173 ZIPCode_92177 ZIPCode_92182 ZIPCode_92192 ZIPCode_92220 ZIPCode_92251 ZIPCode_92325 ZIPCode_92333 ZIPCode_92346 ZIPCode_92350 ZIPCode_92354 ZIPCode_92373 ZIPCode_92374 ZIPCode_92399 ZIPCode_92407 ZIPCode_92507 ZIPCode_92518 ZIPCode_92521 ZIPCode_92606 ZIPCode_92612 ZIPCode_92614 ZIPCode_92624 ZIPCode_92626 ZIPCode_92630 ZIPCode_92634 ZIPCode_92646 ZIPCode_92647 ZIPCode_92648 ZIPCode_92653 ZIPCode_92660 ZIPCode_92661 ZIPCode_92672 ZIPCode_92673 ZIPCode_92675 ZIPCode_92677 ZIPCode_92691 ZIPCode_92692 ZIPCode_92694 ZIPCode_92697 ZIPCode_92703 ZIPCode_92704 ZIPCode_92705 ZIPCode_92709 ZIPCode_92717 ZIPCode_92735 ZIPCode_92780 ZIPCode_92806 ZIPCode_92807 ZIPCode_92821 ZIPCode_92831 ZIPCode_92833 ZIPCode_92834 ZIPCode_92835 ZIPCode_92843 ZIPCode_92866 ZIPCode_92867 ZIPCode_92868 ZIPCode_92870 ZIPCode_92886 ZIPCode_93003 ZIPCode_93009 ZIPCode_93010 ZIPCode_93014 ZIPCode_93022 ZIPCode_93023 ZIPCode_93033 ZIPCode_93063 ZIPCode_93065 ZIPCode_93077 ZIPCode_93101 ZIPCode_93105 ZIPCode_93106 ZIPCode_93107 ZIPCode_93108 ZIPCode_93109 ZIPCode_93111 ZIPCode_93117 ZIPCode_93118 ZIPCode_93302 ZIPCode_93305 ZIPCode_93311 ZIPCode_93401 ZIPCode_93403 ZIPCode_93407 ZIPCode_93437 ZIPCode_93460 ZIPCode_93524 ZIPCode_93555 ZIPCode_93561 ZIPCode_93611 ZIPCode_93657 ZIPCode_93711 ZIPCode_93720 ZIPCode_93727 ZIPCode_93907 ZIPCode_93933 ZIPCode_93940 ZIPCode_93943 ZIPCode_93950 ZIPCode_93955 ZIPCode_94002 ZIPCode_94005 ZIPCode_94010 ZIPCode_94015 ZIPCode_94019 ZIPCode_94022 ZIPCode_94024 ZIPCode_94025 ZIPCode_94028 ZIPCode_94035 ZIPCode_94040 ZIPCode_94043 ZIPCode_94061 ZIPCode_94063 ZIPCode_94065 ZIPCode_94066 ZIPCode_94080 ZIPCode_94085 ZIPCode_94086 ZIPCode_94087 ZIPCode_94102 ZIPCode_94104 ZIPCode_94105 ZIPCode_94107 ZIPCode_94108 ZIPCode_94109 ZIPCode_94110 ZIPCode_94111 ZIPCode_94112 ZIPCode_94114 ZIPCode_94115 ZIPCode_94116 ZIPCode_94117 ZIPCode_94118 ZIPCode_94122 ZIPCode_94123 ZIPCode_94124 ZIPCode_94126 ZIPCode_94131 ZIPCode_94132 ZIPCode_94143 ZIPCode_94234 ZIPCode_94301 ZIPCode_94302 ZIPCode_94303 ZIPCode_94304 ZIPCode_94305 ZIPCode_94306 ZIPCode_94309 ZIPCode_94402 ZIPCode_94404 ZIPCode_94501 ZIPCode_94507 ZIPCode_94509 ZIPCode_94521 ZIPCode_94523 ZIPCode_94526 ZIPCode_94534 ZIPCode_94536 ZIPCode_94538 ZIPCode_94539 ZIPCode_94542 ZIPCode_94545 ZIPCode_94546 ZIPCode_94550 ZIPCode_94551 ZIPCode_94553 ZIPCode_94555 ZIPCode_94558 ZIPCode_94566 ZIPCode_94571 ZIPCode_94575 ZIPCode_94577 ZIPCode_94583 ZIPCode_94588 ZIPCode_94590 ZIPCode_94591 ZIPCode_94596 ZIPCode_94598 ZIPCode_94604 ZIPCode_94606 ZIPCode_94607 ZIPCode_94608 ZIPCode_94609 ZIPCode_94610 ZIPCode_94611 ZIPCode_94612 ZIPCode_94618 ZIPCode_94701 ZIPCode_94703 ZIPCode_94704 ZIPCode_94705 ZIPCode_94706 ZIPCode_94707 ZIPCode_94708 ZIPCode_94709 ZIPCode_94710 ZIPCode_94720 ZIPCode_94801 ZIPCode_94803 ZIPCode_94806 ZIPCode_94901 ZIPCode_94904 ZIPCode_94920 ZIPCode_94923 ZIPCode_94928 ZIPCode_94939 ZIPCode_94949 ZIPCode_94960 ZIPCode_94965 ZIPCode_94970 ZIPCode_94998 ZIPCode_95003 ZIPCode_95005 ZIPCode_95006 ZIPCode_95008 ZIPCode_95010 ZIPCode_95014 ZIPCode_95020 ZIPCode_95023 ZIPCode_95032 ZIPCode_95035 ZIPCode_95037 ZIPCode_95039 ZIPCode_95045 ZIPCode_95051 ZIPCode_95053 ZIPCode_95054 ZIPCode_95060 ZIPCode_95064 ZIPCode_95070 ZIPCode_95112 ZIPCode_95120 ZIPCode_95123 ZIPCode_95125 ZIPCode_95126 ZIPCode_95131 ZIPCode_95133 ZIPCode_95134 ZIPCode_95135 ZIPCode_95136 ZIPCode_95138 ZIPCode_95192 ZIPCode_95193 ZIPCode_95207 ZIPCode_95211 ZIPCode_95307 ZIPCode_95348 ZIPCode_95351 ZIPCode_95354 ZIPCode_95370 ZIPCode_95403 ZIPCode_95405 ZIPCode_95422 ZIPCode_95449 ZIPCode_95482 ZIPCode_95503 ZIPCode_95518 ZIPCode_95521 ZIPCode_95605 ZIPCode_95616 ZIPCode_95617 ZIPCode_95621 ZIPCode_95630 ZIPCode_95670 ZIPCode_95678 ZIPCode_95741 ZIPCode_95747 ZIPCode_95758 ZIPCode_95762 ZIPCode_95812 ZIPCode_95814 ZIPCode_95816 ZIPCode_95817 ZIPCode_95818 ZIPCode_95819 ZIPCode_95820 ZIPCode_95821 ZIPCode_95822 ZIPCode_95825 ZIPCode_95827 ZIPCode_95828 ZIPCode_95831 ZIPCode_95833 ZIPCode_95841 ZIPCode_95842 ZIPCode_95929 ZIPCode_95973 ZIPCode_96001 ZIPCode_96003 ZIPCode_96008 ZIPCode_96064 ZIPCode_96091 ZIPCode_96094 ZIPCode_96145 ZIPCode_96150 ZIPCode_96651
1334 47 22 35 2 1 0 0 1 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4768 38 14 39 1 2 0 0 1 0 61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
65 59 35 131 1 1 0 0 1 1 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
177 29 3 65 4 2 0 0 0 0 33 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4489 39 13 21 3 2 0 0 1 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2895 60 36 39 4 2 0 0 1 0 40 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2763 55 31 13 4 1 0 0 1 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
905 46 22 28 1 1 0 0 1 1 42 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3980 46 22 89 4 2 0 0 1 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
235 38 8 71 4 3 0 0 1 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3500 rows × 478 columns

In [248]:
y_train
Out[248]:
1334    0
4768    0
65      0
177     0
4489    0
       ..
2895    0
2763    0
905     0
3980    0
235     0
Name: Personal_Loan, Length: 3500, dtype: int64
In [249]:
y_test
Out[249]:
2764    0
4767    0
3814    0
3499    0
2735    0
       ..
4140    0
3969    0
2535    0
1361    0
1458    0
Name: Personal_Loan, Length: 1500, dtype: int64
In [250]:
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(X_train, y_train)
In [251]:
# creating confusion matrix for training sets
confusion_matrix_sklearn_with_threshold(lg, X_train, y_train)

From the Confusion Matrix for the Training sets we determine that the LogisticRegression model resulted in TP=290 (6.57%), TN = 3134 (89.54%), and FP=35 (1%), FN=101 (2.89%).

In [252]:
log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold(
    lg, X_train, y_train
)

print("Training performance:")
log_reg_model_train_perf
Training performance:
Out[252]:
Accuracy Recall Precision F1
0 0.961 0.695 0.868 0.772

The model is already at 0.953 accuracy for Training.

In [253]:
# creating confusion matrix for testing sets
confusion_matrix_sklearn_with_threshold(lg, X_test, y_test)

From the Confusion Matrix for the Testing sets we determine that the LogisticRegression model resulted in TP=84 (5.60%), TN = 1334 (88.93%), and FP=17 (1.13%), FN=65 (4.33%).

In [254]:
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
    lg, X_test, y_test
)

print("Test set performance:")
log_reg_model_test_perf
Test set performance:
Out[254]:
Accuracy Recall Precision F1
0 0.945 0.564 0.832 0.672

Accuracy for the Testing Set is at 0.949, sligtly less than for the Training set.

ROC-AUC The area under the ROC curve (AUC) is a measure of how good a model is - The higher the AUC, the better the model is, at distinguishing between classes.

  1. predict_proba Predicts the probabilities for the class 0 and 1.

Input: Train or test data

Output: Returns the predicted probabilities for class 0 and 1

  1. roc_curve_score Returns the auc scores

Input: 1. Training data 2. Predicted Probability

Output: AUC scores between 0 and 1

  1. roc_curve Returns the fpr, tpr and threshold values which takes the original data and predicted probabilities for the class 1.

Input: 1. Training data 2. Predicted Probability

Output: False positive rate, true positive rate and threshold values

ROC-AUC on training set

In [255]:
# Find the roc auc score for training data
logit_roc_auc_train = roc_auc_score(
    y_train, lg.predict_proba(X_train)[:, 1]
)  # The indexing represents predicted probabilities for class 1
# Find fpr, tpr and threshold values
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
# Plot roc curve
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

ROC-AUC on test set

In [256]:
# Find the roc auc score for testing data
logit_roc_auc_test = roc_auc_score(
    y_test, lg.predict_proba(X_test)[:, 1]
)  # The indexing represents predicted probabilities for class 1
# Find fpr, tpr and threshold values
fpr, tpr, thresholds = roc_curve(y_test, lg.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
# Plot roc curve
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

Model is giving a generalized performance.

Model Performance Improvement¶

Let's see if the f1 score can be improved further, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

Optimal thresold is the value that best separated the True positive rate and False positive rate.

In [257]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
# roc_curve returns the fpr, tpr and threshold values which takes the original data and predicted probabilities for the class 1.
fpr, tpr, thresholds = roc_curve(
    y_train, lg.predict_proba(X_train)[:, 1]
)  # The indexing represents predicted probabilities for class 1

optimal_idx = np.argmax(
    tpr - fpr
)  # Finds the index that contains the max difference between tpr and fpr
optimal_threshold_auc_roc = thresholds[
    optimal_idx
]  # stores the optimal threshold value
print(optimal_threshold_auc_roc)
0.1462260641989123

Checking model performance on training set

In [258]:
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)

From the Confusion Matrix for the Training sets we determine that the LogisticRegression model resulted in TP=299 (8.54%), TN = 2957 (84.49%), and FP=299 (8.54%), FN=32 (0.91%).

In [259]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
Out[259]:
Accuracy Recall Precision F1
0 0.930 0.903 0.585 0.710

Notice that the accuracy is 0.930 on the training set.

Checking model performance on test set

In [260]:
# creating confusion matrix on test set
confusion_matrix_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)

From the Confusion Matrix for the Testing sets we determine that the LogisticRegression model resulted in TP=122 (8.13%), TN = 1352 (83.47%), and FP=99 (6.60%), FN=27 (1.80%).

In [261]:
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Test set performance:")
log_reg_model_test_perf_threshold_auc_roc
Test set performance:
Out[261]:
Accuracy Recall Precision F1
0 0.916 0.819 0.552 0.659

The precision of the model for both training and test set has improved but the F1 score has reduced.

Let's use Precision-Recall curve and see if we can find a better threshold

The Precision-Recall curve shows the tradeoff between Precision and Recall for different thresholds. It can be used to select optimal threshold as required to improve the model improvement.

precision_recall_curve()

Returns the fpr, tpr and threshold values

Input: 1. Original data 2. Predicted Probability

Output: False positive rate, true positive rate and threshold values

In [262]:
# Find the predicted probabilities for class 1
y_scores = lg.predict_proba(X_train)[:, 1]
# Find fpr, tpr and threshold values
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
# Plot recall precision curve
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

At the threshold of 0.36, we get balanced recall and precision.

In [263]:
# setting the threshold
optimal_threshold_curve = 0.37

Checking model performance on training set

In [264]:
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_curve
)

From the Confusion Matrix for the Training sets we determine that the LogisticRegression model resulted in TP=253 (25.3%), TN = 3103 (88.66%), and FP=66 (1.89%), FN=78 (2.23%).

In [265]:
log_reg_model_train_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
Out[265]:
Accuracy Recall Precision F1
0 0.959 0.764 0.793 0.778

Checking model performance on test set

In [266]:
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_curve
)

From the Confusion Matrix for the Training sets we determine that the LogisticRegression model resulted in TP=93 (6.20%), TN = 1321 (88.07%), and FP=30 (2%), FN=56 (3.73%).

In [267]:
log_reg_model_test_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_curve
)
print("Test set performance:")
log_reg_model_test_perf_threshold_curve
Test set performance:
Out[267]:
Accuracy Recall Precision F1
0 0.943 0.624 0.756 0.684

Model is performing well on training and test. There's not much improvement in the model performance as the default threshold is 0.50 and here we get 0.37 as the optimal threshold.

Model Performance Summary

In [268]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.15 Threshold",
    "Logistic Regression-0.37 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[268]:
Logistic Regression sklearn Logistic Regression-0.15 Threshold Logistic Regression-0.37 Threshold
Accuracy 0.961 0.930 0.959
Recall 0.695 0.903 0.764
Precision 0.868 0.585 0.793
F1 0.772 0.710 0.778
In [269]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.15 Threshold",
    "Logistic Regression-0.37 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[269]:
Logistic Regression sklearn Logistic Regression-0.15 Threshold Logistic Regression-0.37 Threshold
Accuracy 0.945 0.916 0.943
Recall 0.564 0.819 0.624
Precision 0.832 0.552 0.756
F1 0.672 0.659 0.684

We can use the Logistic Regression-0.37 Threshold which yields the highest F1 score and a slightly smaller accuracy than the Logistic Regression sklearn model.

Logistic Regression-0.15 Threshold yields the largest Recall but Recall isn't a good metric for this case because the data is not balance (between Personal Loans yes and persona Loans No).

Logistic Regression-0.37 Threshold gives F1 0.68 and this is the mode we recommend if using Logistic Regression.

Decision Tree Modelling¶

Use 'gini' criteria to split

In [270]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import sklearn.metrics as metrics
In [271]:
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
Out[271]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Decision Tree Performance¶

In [272]:
#scoring
print("Accuracy on training set : ",dTree.score(X_train, y_train))
print("Accuracy on test set : ",dTree.score(X_test, y_test))
Accuracy on training set :  1.0
Accuracy on test set :  0.9746666666666667

Recall should be the metric of the model because Recall gives the ratio of True positives to Actual positives, which means liability customers that actually purchase a Personal Loan product.

In [273]:
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth

    '''
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [274]:
##  Function to calculate recall score
def get_recall_score(model):
    '''
    model : classifier to predict values of X

    '''
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
    print("Recall on test set : ",metrics.recall_score(y_test,pred_test))

Confusion Matrix¶

In [275]:
make_confusion_matrix(dTree,y_test)
In [276]:
# Recall on train and test
get_recall_score(dTree)
Recall on training set :  1.0
Recall on test set :  0.8322147651006712

Visualization of Decision Tree¶

In [277]:
feature_names = list(X.columns)
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'CCAvg_Inc%', 'year_Mortgage_Inc%', 'ZIPCode_90005', 'ZIPCode_90007', 'ZIPCode_90009', 'ZIPCode_90011', 'ZIPCode_90016', 'ZIPCode_90018', 'ZIPCode_90019', 'ZIPCode_90024', 'ZIPCode_90025', 'ZIPCode_90027', 'ZIPCode_90028', 'ZIPCode_90029', 'ZIPCode_90032', 'ZIPCode_90033', 'ZIPCode_90034', 'ZIPCode_90035', 'ZIPCode_90036', 'ZIPCode_90037', 'ZIPCode_90041', 'ZIPCode_90044', 'ZIPCode_90045', 'ZIPCode_90048', 'ZIPCode_90049', 'ZIPCode_90057', 'ZIPCode_90058', 'ZIPCode_90059', 'ZIPCode_90064', 'ZIPCode_90065', 'ZIPCode_90066', 'ZIPCode_90068', 'ZIPCode_90071', 'ZIPCode_90073', 'ZIPCode_90086', 'ZIPCode_90089', 'ZIPCode_90095', 'ZIPCode_90210', 'ZIPCode_90212', 'ZIPCode_90230', 'ZIPCode_90232', 'ZIPCode_90245', 'ZIPCode_90250', 'ZIPCode_90254', 'ZIPCode_90266', 'ZIPCode_90272', 'ZIPCode_90274', 'ZIPCode_90275', 'ZIPCode_90277', 'ZIPCode_90280', 'ZIPCode_90291', 'ZIPCode_90304', 'ZIPCode_90401', 'ZIPCode_90404', 'ZIPCode_90405', 'ZIPCode_90502', 'ZIPCode_90503', 'ZIPCode_90504', 'ZIPCode_90505', 'ZIPCode_90509', 'ZIPCode_90601', 'ZIPCode_90623', 'ZIPCode_90630', 'ZIPCode_90638', 'ZIPCode_90639', 'ZIPCode_90640', 'ZIPCode_90650', 'ZIPCode_90717', 'ZIPCode_90720', 'ZIPCode_90740', 'ZIPCode_90745', 'ZIPCode_90747', 'ZIPCode_90755', 'ZIPCode_90813', 'ZIPCode_90840', 'ZIPCode_91006', 'ZIPCode_91007', 'ZIPCode_91016', 'ZIPCode_91024', 'ZIPCode_91030', 'ZIPCode_91040', 'ZIPCode_91101', 'ZIPCode_91103', 'ZIPCode_91105', 'ZIPCode_91107', 'ZIPCode_91109', 'ZIPCode_91116', 'ZIPCode_91125', 'ZIPCode_91129', 'ZIPCode_91203', 'ZIPCode_91207', 'ZIPCode_91301', 'ZIPCode_91302', 'ZIPCode_91304', 'ZIPCode_91311', 'ZIPCode_91320', 'ZIPCode_91326', 'ZIPCode_91330', 'ZIPCode_91335', 'ZIPCode_91342', 'ZIPCode_91343', 'ZIPCode_91345', 'ZIPCode_91355', 'ZIPCode_91360', 'ZIPCode_91361', 'ZIPCode_91365', 'ZIPCode_91367', 'ZIPCode_91380', 'ZIPCode_91401', 'ZIPCode_91423', 'ZIPCode_91604', 'ZIPCode_91605', 'ZIPCode_91614', 'ZIPCode_91706', 'ZIPCode_91709', 'ZIPCode_91710', 'ZIPCode_91711', 'ZIPCode_91730', 'ZIPCode_91741', 'ZIPCode_91745', 'ZIPCode_91754', 'ZIPCode_91763', 'ZIPCode_91765', 'ZIPCode_91768', 'ZIPCode_91770', 'ZIPCode_91773', 'ZIPCode_91775', 'ZIPCode_91784', 'ZIPCode_91791', 'ZIPCode_91801', 'ZIPCode_91902', 'ZIPCode_91910', 'ZIPCode_91911', 'ZIPCode_91941', 'ZIPCode_91942', 'ZIPCode_91950', 'ZIPCode_92007', 'ZIPCode_92008', 'ZIPCode_92009', 'ZIPCode_92024', 'ZIPCode_92028', 'ZIPCode_92029', 'ZIPCode_92037', 'ZIPCode_92038', 'ZIPCode_92054', 'ZIPCode_92056', 'ZIPCode_92064', 'ZIPCode_92068', 'ZIPCode_92069', 'ZIPCode_92084', 'ZIPCode_92093', 'ZIPCode_92096', 'ZIPCode_92101', 'ZIPCode_92103', 'ZIPCode_92104', 'ZIPCode_92106', 'ZIPCode_92109', 'ZIPCode_92110', 'ZIPCode_92115', 'ZIPCode_92116', 'ZIPCode_92120', 'ZIPCode_92121', 'ZIPCode_92122', 'ZIPCode_92123', 'ZIPCode_92124', 'ZIPCode_92126', 'ZIPCode_92129', 'ZIPCode_92130', 'ZIPCode_92131', 'ZIPCode_92152', 'ZIPCode_92154', 'ZIPCode_92161', 'ZIPCode_92173', 'ZIPCode_92177', 'ZIPCode_92182', 'ZIPCode_92192', 'ZIPCode_92220', 'ZIPCode_92251', 'ZIPCode_92325', 'ZIPCode_92333', 'ZIPCode_92346', 'ZIPCode_92350', 'ZIPCode_92354', 'ZIPCode_92373', 'ZIPCode_92374', 'ZIPCode_92399', 'ZIPCode_92407', 'ZIPCode_92507', 'ZIPCode_92518', 'ZIPCode_92521', 'ZIPCode_92606', 'ZIPCode_92612', 'ZIPCode_92614', 'ZIPCode_92624', 'ZIPCode_92626', 'ZIPCode_92630', 'ZIPCode_92634', 'ZIPCode_92646', 'ZIPCode_92647', 'ZIPCode_92648', 'ZIPCode_92653', 'ZIPCode_92660', 'ZIPCode_92661', 'ZIPCode_92672', 'ZIPCode_92673', 'ZIPCode_92675', 'ZIPCode_92677', 'ZIPCode_92691', 'ZIPCode_92692', 'ZIPCode_92694', 'ZIPCode_92697', 'ZIPCode_92703', 'ZIPCode_92704', 'ZIPCode_92705', 'ZIPCode_92709', 'ZIPCode_92717', 'ZIPCode_92735', 'ZIPCode_92780', 'ZIPCode_92806', 'ZIPCode_92807', 'ZIPCode_92821', 'ZIPCode_92831', 'ZIPCode_92833', 'ZIPCode_92834', 'ZIPCode_92835', 'ZIPCode_92843', 'ZIPCode_92866', 'ZIPCode_92867', 'ZIPCode_92868', 'ZIPCode_92870', 'ZIPCode_92886', 'ZIPCode_93003', 'ZIPCode_93009', 'ZIPCode_93010', 'ZIPCode_93014', 'ZIPCode_93022', 'ZIPCode_93023', 'ZIPCode_93033', 'ZIPCode_93063', 'ZIPCode_93065', 'ZIPCode_93077', 'ZIPCode_93101', 'ZIPCode_93105', 'ZIPCode_93106', 'ZIPCode_93107', 'ZIPCode_93108', 'ZIPCode_93109', 'ZIPCode_93111', 'ZIPCode_93117', 'ZIPCode_93118', 'ZIPCode_93302', 'ZIPCode_93305', 'ZIPCode_93311', 'ZIPCode_93401', 'ZIPCode_93403', 'ZIPCode_93407', 'ZIPCode_93437', 'ZIPCode_93460', 'ZIPCode_93524', 'ZIPCode_93555', 'ZIPCode_93561', 'ZIPCode_93611', 'ZIPCode_93657', 'ZIPCode_93711', 'ZIPCode_93720', 'ZIPCode_93727', 'ZIPCode_93907', 'ZIPCode_93933', 'ZIPCode_93940', 'ZIPCode_93943', 'ZIPCode_93950', 'ZIPCode_93955', 'ZIPCode_94002', 'ZIPCode_94005', 'ZIPCode_94010', 'ZIPCode_94015', 'ZIPCode_94019', 'ZIPCode_94022', 'ZIPCode_94024', 'ZIPCode_94025', 'ZIPCode_94028', 'ZIPCode_94035', 'ZIPCode_94040', 'ZIPCode_94043', 'ZIPCode_94061', 'ZIPCode_94063', 'ZIPCode_94065', 'ZIPCode_94066', 'ZIPCode_94080', 'ZIPCode_94085', 'ZIPCode_94086', 'ZIPCode_94087', 'ZIPCode_94102', 'ZIPCode_94104', 'ZIPCode_94105', 'ZIPCode_94107', 'ZIPCode_94108', 'ZIPCode_94109', 'ZIPCode_94110', 'ZIPCode_94111', 'ZIPCode_94112', 'ZIPCode_94114', 'ZIPCode_94115', 'ZIPCode_94116', 'ZIPCode_94117', 'ZIPCode_94118', 'ZIPCode_94122', 'ZIPCode_94123', 'ZIPCode_94124', 'ZIPCode_94126', 'ZIPCode_94131', 'ZIPCode_94132', 'ZIPCode_94143', 'ZIPCode_94234', 'ZIPCode_94301', 'ZIPCode_94302', 'ZIPCode_94303', 'ZIPCode_94304', 'ZIPCode_94305', 'ZIPCode_94306', 'ZIPCode_94309', 'ZIPCode_94402', 'ZIPCode_94404', 'ZIPCode_94501', 'ZIPCode_94507', 'ZIPCode_94509', 'ZIPCode_94521', 'ZIPCode_94523', 'ZIPCode_94526', 'ZIPCode_94534', 'ZIPCode_94536', 'ZIPCode_94538', 'ZIPCode_94539', 'ZIPCode_94542', 'ZIPCode_94545', 'ZIPCode_94546', 'ZIPCode_94550', 'ZIPCode_94551', 'ZIPCode_94553', 'ZIPCode_94555', 'ZIPCode_94558', 'ZIPCode_94566', 'ZIPCode_94571', 'ZIPCode_94575', 'ZIPCode_94577', 'ZIPCode_94583', 'ZIPCode_94588', 'ZIPCode_94590', 'ZIPCode_94591', 'ZIPCode_94596', 'ZIPCode_94598', 'ZIPCode_94604', 'ZIPCode_94606', 'ZIPCode_94607', 'ZIPCode_94608', 'ZIPCode_94609', 'ZIPCode_94610', 'ZIPCode_94611', 'ZIPCode_94612', 'ZIPCode_94618', 'ZIPCode_94701', 'ZIPCode_94703', 'ZIPCode_94704', 'ZIPCode_94705', 'ZIPCode_94706', 'ZIPCode_94707', 'ZIPCode_94708', 'ZIPCode_94709', 'ZIPCode_94710', 'ZIPCode_94720', 'ZIPCode_94801', 'ZIPCode_94803', 'ZIPCode_94806', 'ZIPCode_94901', 'ZIPCode_94904', 'ZIPCode_94920', 'ZIPCode_94923', 'ZIPCode_94928', 'ZIPCode_94939', 'ZIPCode_94949', 'ZIPCode_94960', 'ZIPCode_94965', 'ZIPCode_94970', 'ZIPCode_94998', 'ZIPCode_95003', 'ZIPCode_95005', 'ZIPCode_95006', 'ZIPCode_95008', 'ZIPCode_95010', 'ZIPCode_95014', 'ZIPCode_95020', 'ZIPCode_95023', 'ZIPCode_95032', 'ZIPCode_95035', 'ZIPCode_95037', 'ZIPCode_95039', 'ZIPCode_95045', 'ZIPCode_95051', 'ZIPCode_95053', 'ZIPCode_95054', 'ZIPCode_95060', 'ZIPCode_95064', 'ZIPCode_95070', 'ZIPCode_95112', 'ZIPCode_95120', 'ZIPCode_95123', 'ZIPCode_95125', 'ZIPCode_95126', 'ZIPCode_95131', 'ZIPCode_95133', 'ZIPCode_95134', 'ZIPCode_95135', 'ZIPCode_95136', 'ZIPCode_95138', 'ZIPCode_95192', 'ZIPCode_95193', 'ZIPCode_95207', 'ZIPCode_95211', 'ZIPCode_95307', 'ZIPCode_95348', 'ZIPCode_95351', 'ZIPCode_95354', 'ZIPCode_95370', 'ZIPCode_95403', 'ZIPCode_95405', 'ZIPCode_95422', 'ZIPCode_95449', 'ZIPCode_95482', 'ZIPCode_95503', 'ZIPCode_95518', 'ZIPCode_95521', 'ZIPCode_95605', 'ZIPCode_95616', 'ZIPCode_95617', 'ZIPCode_95621', 'ZIPCode_95630', 'ZIPCode_95670', 'ZIPCode_95678', 'ZIPCode_95741', 'ZIPCode_95747', 'ZIPCode_95758', 'ZIPCode_95762', 'ZIPCode_95812', 'ZIPCode_95814', 'ZIPCode_95816', 'ZIPCode_95817', 'ZIPCode_95818', 'ZIPCode_95819', 'ZIPCode_95820', 'ZIPCode_95821', 'ZIPCode_95822', 'ZIPCode_95825', 'ZIPCode_95827', 'ZIPCode_95828', 'ZIPCode_95831', 'ZIPCode_95833', 'ZIPCode_95841', 'ZIPCode_95842', 'ZIPCode_95929', 'ZIPCode_95973', 'ZIPCode_96001', 'ZIPCode_96003', 'ZIPCode_96008', 'ZIPCode_96064', 'ZIPCode_96091', 'ZIPCode_96094', 'ZIPCode_96145', 'ZIPCode_96150', 'ZIPCode_96651']
In [278]:
plt.figure(figsize=(20,30))
tree.plot_tree(dTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [279]:
# Text report showing the rules of a decision tree -

print(tree.export_text(dTree,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50
|   |--- Income <= 92.50
|   |   |--- ZIPCode_90601 <= 0.50
|   |   |   |--- Income <= 81.50
|   |   |   |   |--- ZIPCode_91203 <= 0.50
|   |   |   |   |   |--- ZIPCode_94305 <= 0.50
|   |   |   |   |   |   |--- weights: [2198.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode_94305 >  0.50
|   |   |   |   |   |   |--- CCAvg_Inc% <= 52.50
|   |   |   |   |   |   |   |--- weights: [45.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg_Inc% >  52.50
|   |   |   |   |   |   |   |--- CCAvg_Inc% <= 53.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg_Inc% >  53.50
|   |   |   |   |   |   |   |   |--- Income <= 71.00
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  71.00
|   |   |   |   |   |   |   |   |   |--- Age <= 43.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Age >  43.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- ZIPCode_91203 >  0.50
|   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Income >  81.50
|   |   |   |   |--- CCAvg_Inc% <= 43.50
|   |   |   |   |   |--- ZIPCode_92121 <= 0.50
|   |   |   |   |   |   |--- weights: [248.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode_92121 >  0.50
|   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- CCAvg_Inc% >  43.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |--- Age <= 45.50
|   |   |   |   |   |   |   |   |--- Age <= 28.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  28.00
|   |   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  45.50
|   |   |   |   |   |   |   |   |--- year_Mortgage_Inc% <= 5.00
|   |   |   |   |   |   |   |   |   |--- ZIPCode_94720 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |   |   |--- ZIPCode_94720 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- year_Mortgage_Inc% >  5.00
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |--- ZIPCode_94122 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_94122 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |--- ZIPCode_90601 >  0.50
|   |   |   |--- Income <= 71.00
|   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Income >  71.00
|   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |--- Income >  92.50
|   |   |--- CCAvg_Inc% <= 37.50
|   |   |   |--- ZIPCode_93106 <= 0.50
|   |   |   |   |--- ZIPCode_90049 <= 0.50
|   |   |   |   |   |--- ZIPCode_95822 <= 0.50
|   |   |   |   |   |   |--- ZIPCode_94705 <= 0.50
|   |   |   |   |   |   |   |--- ZIPCode_91129 <= 0.50
|   |   |   |   |   |   |   |   |--- ZIPCode_92007 <= 0.50
|   |   |   |   |   |   |   |   |   |--- ZIPCode_94110 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- ZIPCode_94305 <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- ZIPCode_94305 >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- ZIPCode_94110 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- ZIPCode_92007 >  0.50
|   |   |   |   |   |   |   |   |   |--- Education <= 2.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- ZIPCode_91129 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- ZIPCode_94705 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- ZIPCode_95822 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- ZIPCode_90049 >  0.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- ZIPCode_93106 >  0.50
|   |   |   |   |--- Experience <= 33.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Experience >  33.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |--- CCAvg_Inc% >  37.50
|   |   |   |--- Education <= 1.50
|   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |--- Experience <= 5.00
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Experience >  5.00
|   |   |   |   |   |   |--- ZIPCode_92709 <= 0.50
|   |   |   |   |   |   |   |--- ZIPCode_90034 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_90034 >  0.50
|   |   |   |   |   |   |   |   |--- Age <= 50.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  50.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- ZIPCode_92709 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |--- CCAvg_Inc% <= 59.00
|   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |--- CCAvg_Inc% >  59.00
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Education >  1.50
|   |   |   |   |--- Income <= 106.00
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- CCAvg_Inc% <= 51.50
|   |   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |   |--- ZIPCode_90277 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |   |--- ZIPCode_90277 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg_Inc% >  51.50
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- Age <= 62.50
|   |   |   |   |   |   |   |--- weights: [0.00, 11.00] class: 1
|   |   |   |   |   |   |--- Age >  62.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.00
|   |   |   |   |   |--- weights: [0.00, 10.00] class: 1
|--- Income >  116.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [0.00, 222.00] class: 1

In [280]:
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                Imp
Education     0.403
Income        0.310
Family        0.143
CCAvg_Inc%    0.049
Age           0.020
...             ...
ZIPCode_92110 0.000
ZIPCode_92109 0.000
ZIPCode_92106 0.000
ZIPCode_92104 0.000
ZIPCode_96651 0.000

[478 rows x 1 columns]

According to the decision tree model, Education is the most important variable for predicting if the liaility customer will purchase a Personal Loan product. Notice that from our previous analysis, we determined that Income was the most important feature.

The tree above is very complex, such a tree often overfits. So lets reduce dimensions.

Reducing Over Fitting¶

In [281]:
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
Out[281]:
DecisionTreeClassifier(max_depth=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, random_state=1)

Confusion Matrix - decision tree with depth restricted to 3¶

In [282]:
make_confusion_matrix(dTree1, y_test)
In [283]:
# Accuracy on train and test
print("Accuracy on training set : ",dTree1.score(X_train, y_train))
print("Accuracy on test set : ",dTree1.score(X_test, y_test))
# Recall on train and test
get_recall_score(dTree1)
Accuracy on training set :  0.9822857142857143
Accuracy on test set :  0.9753333333333334
Recall on training set :  0.8126888217522659
Recall on test set :  0.7516778523489933

Recall on Training has reduced from 1 to 0.81 which is better because the model overfitting has been reduced.

Visualization of Decision Tree¶

In [284]:
plt.figure(figsize=(15,10))

tree.plot_tree(dTree1,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [285]:
# Text report showing the rules of a decision tree -

print(tree.export_text(dTree1,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50
|   |--- Income <= 92.50
|   |   |--- ZIPCode_90601 <= 0.50
|   |   |   |--- weights: [2550.00, 14.00] class: 0
|   |   |--- ZIPCode_90601 >  0.50
|   |   |   |--- weights: [2.00, 1.00] class: 0
|   |--- Income >  92.50
|   |   |--- CCAvg_Inc% <= 37.50
|   |   |   |--- weights: [207.00, 14.00] class: 0
|   |   |--- CCAvg_Inc% >  37.50
|   |   |   |--- weights: [35.00, 33.00] class: 0
|--- Income >  116.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [0.00, 222.00] class: 1

The tree has become readable and the recall on the test set has reduced from 0.83 to 0.75

In [286]:
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print (pd.DataFrame(dTree1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                Imp
Education     0.450
Income        0.348
Family        0.164
CCAvg_Inc%    0.036
ZIPCode_90601 0.001
...             ...
ZIPCode_92093 0.000
ZIPCode_92084 0.000
ZIPCode_92069 0.000
ZIPCode_92068 0.000
ZIPCode_96651 0.000

[478 rows x 1 columns]

With a depth of 3, the model is undefitting. So we are going to find the best values.

Using GridSearch for Hyperparameter tuning of our tree model¶

Notes: Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters. It is an exhaustive search that is performed on a the specific parameter values of a model. The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-searc

In [287]:
from sklearn.model_selection import GridSearchCV
In [288]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(1,10),
              'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
              'max_leaf_nodes' : [2, 3, 5, 10],
              'min_impurity_decrease': [0.001,0.01,0.1]
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[288]:
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, min_samples_leaf=7,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=10,
                       min_impurity_decrease=0.001, min_samples_leaf=7,
                       random_state=1)

Confusion Matrix - decision tree with tuned hyperparameters¶

In [289]:
make_confusion_matrix(estimator,y_test)
In [290]:
# Accuracy on train and test
print("Accuracy on training set : ",estimator.score(X_train, y_train))
print("Accuracy on test set : ",estimator.score(X_test, y_test))
# Recall on train and test
get_recall_score(estimator)
Accuracy on training set :  0.9871428571428571
Accuracy on test set :  0.9786666666666667
Recall on training set :  0.8912386706948641
Recall on test set :  0.8187919463087249

Recall is worse for both train and test set after hyperparameter tuning.

Decision Tree Visualization¶

In [291]:
plt.figure(figsize=(15,10))

tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [292]:
# Text report showing the rules of a decision tree -

print(tree.export_text(estimator,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50
|   |--- Income <= 92.50
|   |   |--- weights: [2552.00, 15.00] class: 0
|   |--- Income >  92.50
|   |   |--- CCAvg_Inc% <= 37.50
|   |   |   |--- weights: [207.00, 14.00] class: 0
|   |   |--- CCAvg_Inc% >  37.50
|   |   |   |--- Education <= 1.50
|   |   |   |   |--- weights: [26.00, 7.00] class: 0
|   |   |   |--- Education >  1.50
|   |   |   |   |--- weights: [9.00, 26.00] class: 1
|--- Income >  116.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [0.00, 222.00] class: 1

In [293]:
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

#Here we will see that importance of features has increased
                Imp
Education     0.461
Income        0.342
Family        0.161
CCAvg_Inc%    0.036
ZIPCode_94234 0.000
...             ...
ZIPCode_92084 0.000
ZIPCode_92069 0.000
ZIPCode_92068 0.000
ZIPCode_92064 0.000
ZIPCode_96651 0.000

[478 rows x 1 columns]

Post pruning might give better results, since there is quite a good possibility that we might neglect some hyperparameters, post pruning will take care of all that.

Cost Complexity Pruning¶

Notes: The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Notes: Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

In [294]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [295]:
pd.DataFrame(path)
Out[295]:
ccp_alphas impurities
0 0.000 0.000
1 0.000 0.001
2 0.000 0.002
3 0.000 0.004
4 0.000 0.005
5 0.000 0.005
6 0.000 0.005
7 0.000 0.006
8 0.000 0.007
9 0.000 0.008
10 0.000 0.008
11 0.000 0.008
12 0.000 0.009
13 0.000 0.009
14 0.000 0.010
15 0.001 0.011
16 0.001 0.013
17 0.001 0.014
18 0.001 0.014
19 0.001 0.015
20 0.001 0.019
21 0.001 0.020
22 0.001 0.022
23 0.001 0.023
24 0.003 0.026
25 0.004 0.035
26 0.024 0.059
27 0.056 0.171
In [296]:
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [297]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [298]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Accuracy vs alpha for training and testing sets¶

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 69% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.

In [299]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [300]:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [301]:
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.0005952380952380953, random_state=1)
Training accuracy of best model:  0.9908571428571429
Test accuracy of best model:  0.9786666666666667

Since accuracy isn't the right metric for our data we would want high recall

In [302]:
recall_train=[]
for clf in clfs:
    pred_train3=clf.predict(X_train)
    values_train=metrics.recall_score(y_train,pred_train3)
    recall_train.append(values_train)
In [303]:
recall_test=[]
for clf in clfs:
    pred_test3=clf.predict(X_test)
    values_test=metrics.recall_score(y_test,pred_test3)
    recall_test.append(values_test)
In [304]:
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [305]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0004761904761904762, random_state=1)

Confusion Matrix - post-pruned decision tree¶

In [306]:
make_confusion_matrix(best_model,y_test)
In [307]:
# Recall on train and test
get_recall_score(best_model)
Recall on training set :  0.9577039274924471
Recall on test set :  0.8389261744966443
In [308]:
plt.figure(figsize=(17,15))

tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [309]:
# Text report showing the rules of a decision tree -

print(tree.export_text(best_model,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50
|   |--- Income <= 92.50
|   |   |--- ZIPCode_90601 <= 0.50
|   |   |   |--- Income <= 81.50
|   |   |   |   |--- weights: [2258.00, 3.00] class: 0
|   |   |   |--- Income >  81.50
|   |   |   |   |--- CCAvg_Inc% <= 43.50
|   |   |   |   |   |--- weights: [250.00, 1.00] class: 0
|   |   |   |   |--- CCAvg_Inc% >  43.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |--- Age <= 45.50
|   |   |   |   |   |   |   |   |--- Age <= 28.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  28.00
|   |   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  45.50
|   |   |   |   |   |   |   |   |--- year_Mortgage_Inc% <= 5.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 5.00] class: 1
|   |   |   |   |   |   |   |   |--- year_Mortgage_Inc% >  5.00
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |--- ZIPCode_94122 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_94122 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |--- ZIPCode_90601 >  0.50
|   |   |   |--- weights: [2.00, 1.00] class: 0
|   |--- Income >  92.50
|   |   |--- CCAvg_Inc% <= 37.50
|   |   |   |--- ZIPCode_93106 <= 0.50
|   |   |   |   |--- ZIPCode_90049 <= 0.50
|   |   |   |   |   |--- ZIPCode_95822 <= 0.50
|   |   |   |   |   |   |--- ZIPCode_94705 <= 0.50
|   |   |   |   |   |   |   |--- ZIPCode_91129 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [206.00, 8.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_91129 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- ZIPCode_94705 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- ZIPCode_95822 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- ZIPCode_90049 >  0.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- ZIPCode_93106 >  0.50
|   |   |   |   |--- weights: [1.00, 2.00] class: 1
|   |   |--- CCAvg_Inc% >  37.50
|   |   |   |--- Education <= 1.50
|   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |--- Experience <= 5.00
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Experience >  5.00
|   |   |   |   |   |   |--- ZIPCode_92709 <= 0.50
|   |   |   |   |   |   |   |--- weights: [25.00, 1.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_92709 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |--- weights: [1.00, 4.00] class: 1
|   |   |   |--- Education >  1.50
|   |   |   |   |--- Income <= 106.00
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- CCAvg_Inc% <= 51.50
|   |   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg_Inc% >  51.50
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- Age <= 62.50
|   |   |   |   |   |   |   |--- weights: [0.00, 11.00] class: 1
|   |   |   |   |   |   |--- Age >  62.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.00
|   |   |   |   |   |--- weights: [0.00, 10.00] class: 1
|--- Income >  116.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [0.00, 222.00] class: 1

In [310]:
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                Imp
Education     0.422
Income        0.322
Family        0.152
CCAvg_Inc%    0.042
CD_Account    0.018
...             ...
ZIPCode_92096 0.000
ZIPCode_92093 0.000
ZIPCode_92084 0.000
ZIPCode_92069 0.000
ZIPCode_96651 0.000

[478 rows x 1 columns]
In [311]:
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision tree with restricted maximum depth','Decision treee with hyperparameter tuning',
                                         'Decision tree with post-pruning'], 'Train_Recall':[1,0.81,0.89,0.95], 'Test_Recall':[0.83,0.75,0.81,0.84]})
comparison_frame
Out[311]:
Model Train_Recall Test_Recall
0 Initial decision tree model 1.000 0.830
1 Decision tree with restricted maximum depth 0.810 0.750
2 Decision treee with hyperparameter tuning 0.890 0.810
3 Decision tree with post-pruning 0.950 0.840

Decision Tree with post-pruning gives the highest recall on test set (0.84).

Comparing Logistic Regression and Decision Tree¶

The metric we are using to evaluate the Logistic Regression Model is F1 while we are using Recall to evauate the Decision Tree.

Decision Tree is easier to interpret than Logistic Regression, it also bisects the space into smaller spaces, it automatically handles decision making thresholds.

However, Decision tree is prone to overfitting and noise.

On the other hand, Logistic Regression is sturdy in the sense that is not prone to overifting and robust to noise. However is more complex, thresholds mut be set manually, and is more difficult to intrerpret.

if using Logistic Regression, consider using the Logistic Regression-0.37 Threshold that yields a Test F1 of 0.68.

If using Decision tree, consider using the Decision Tree with Post-Pruning that gives a Test Recall of 0.84.

Which model is better in our case or what model should we use? You can read this in our conclusions.

Conclusions¶

This Logistic Regression model with a threshold of 0.37 can predict with an accuracy of 0.961 on the training set and 0.945 on the testing set, which liability customers will borrow money (Personal Loan) while maintaining other bank Products. Our model has f1 score of 0.672 on the testing and 0.772 on the training sets.

The Decision Tree model w/ Post Pruning yields a recall of train 0.950 and of test 0.840 which are good.

In this particular case we would recommend trying the Decision tree model because not only is it easier to understand, but it can be visualized and its 'automatized' in the sense that since no threshold has to be chosen manually, this type of bias doesn't exist. We tried to minimize the overfitting and improve the model prunning it.

From the analysis, we determined that Income, CD Accounts, and Education are the most important features that influence whether a liability customer will purchase a Personal Loan Product.

  • Higher income customers, tend to purchase loans more than lower income customers. Customers in the Income range of 120K to 190K have the not only the most Personal Loans compared to other income ranges but also the highest density of Personal Loans within their group.

  • Customers with CD Accounts also tend to purchase more Personal Loans.

  • Customers with higher education (Professional/Advanced) tend to purchase more Personal Loans.

Regarding our features that are less important than the 3 above, we have a few observations:

  • The Age group in the 60s, 20s, and 30s, tend to purchase more loans than the other age groups.
  • Family size 4 customers tend to purchase more loans.
  • CC Average, customers that spend an average of 40-50% of their monthly income on a credit card, also tend to purchase more loans.
  • Mortgage customers with monthly values of 5-15% of their monthly income also tend to purchase more loans.
  • Security Accounts customers also tend to purchase more loans than those that do not have one.
  • ZIP Code is a controversial observation. We noticed that ZIP codes in which a large university is located tend to purchase more loans. These customers are not necessarily students, but they may be associated to the schools in those ZIP code areas. In particular ZIP Codes 94720, 94305, 95616, 90095, and 93106 stand out.

Other features that don't seem to be relevant are Experience, Credit Card, and Online access.

Regarding our model, we use Logistic Regression-0.58 Threshold which yields the highest F1 score and if we use Decision Tree with post pruning, a recall of 0.95 (Train) and 0.84 (Test).

Recommendations¶

We dare to make the following recommendations:

Target the following Liability Customers:

  1. Located in ZIP Codes where large or prestigious universities or colleges, including Silicon Valley.
  2. Income over 100K.
  3. With Higher Education (Advanced Degrees/Professional Degrees) first, then with Postgraduate education.

Fine Tune the Customers in the Target Group above (Preferably):

  1. Have a CD_Account.
  2. Have a Mortgage monthly payment is no more than 15% of their monthly income.
  3. Do not have more than 50% of their monthly income on their monthly average credit card expense.
  4. Have 4 or 3 family members.

If looking for new customers, we recommend attracting new customers in the ZIP Codes mentioned above and not in the undegraduate age group but in the age ranges of 30s to 60s.

Thank you.